Chapter 7: AI Audio

1. AI Music Composition

Lejaren Hiller – _Illiac Suite_ for String Quartet (1957), the first computer-composed classical music.

Pioneers like Hiller and Isaacson showed in 1957 how computers could compose music. Today, modern AI tools like AIVA and Soundful generate polyphonic, emotionally nuanced compositions that inspire—but don’t replace—human creativity. AI thus serves as a collaborative partner, helping shape melodies, harmonies, and structures.

Musical AI must master several layers of complexity:

Temporal structure: Music unfolds over time—AI must generate coherent passages, not just isolated notes.
Spectral richness: Balancing frequencies from bass to treble.
Harmony and rhythm: Creating chord progressions, maintaining tempo and syncopation.
Large-scale form: Developing verses, choruses, movements or thematic arcs.

To tackle these challenges, AI composers employ:

CNNs to analyze the tonal and rhythmic patterns of audio data.
RNNs and LSTMs to maintain musical memory across long sections.
GANs to raise quality by competing outputs (generator vs. discriminator).

These models—particularly in tools like AIVA and Soundful—allow creators to generate full-length pieces, customize style, and iterate through drafts quickly. While AI simplifies repetitive tasks, the human role remains crucial: editing, choosing emotional direction, and giving artistic nuance. That’s where the music truly comes alive.

2. AI Voice

TVODER (1939) - Early Speech Synthesizer | VintageCG

Identity Upgrade (Chapter 2) | Digital artist Jhave collaborates with GPT-4, Dall.e3, SDXL, PikaLabs, and Audio tools such as Riffusion, Elevenlabs.

The evolution of AI voice synthesis has been a remarkable journey, from the mechanical speech synthesis of the early 20th century to the seamless mimicry of human speech achieved by modern models like Eleven Labs and HeyGen Labs. Early efforts, such as the Voder demonstrated at the 1939 World's Fair, relied on manual control of a keyboard and foot pedals to produce speech sounds. While innovative, these early systems were limited in expressiveness and naturalness.

In the 1950s and 1960s, formant synthesis emerged, using knowledge-based algorithms to simulate the resonant frequencies of the human vocal tract. Although this approach represented a significant advancement, the resulting speech often sounded robotic. The 1990s saw the rise of concatenative synthesis, which involved stitching together small units of recorded speech. This method greatly improved naturalness but still struggled with intonation and emotion.

Recent advances in deep learning have revolutionized AI voice generation. Techniques like WaveNet, introduced by DeepMind in 2016, leverage Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) networks, to produce speech that closely mimics the human voice in terms of naturalness, intonation, and emotion. These models are capable of generating highly realistic and expressive synthetic voices that can be tailored to a wide range of applications, from audiobook narration and virtual assistants to video game character voices.

However, the powerful capabilities of AI voice modeling come with significant challenges and risks. One of the primary challenges is achieving naturalness and expressiveness in synthesized speech, which involves capturing human emotions, tone, and style in a way that feels authentic. Additionally, there is the need for language and accent diversity, which requires extensive datasets and sophisticated modeling to ensure inclusivity and accuracy across a vast array of dialects and languages. Computational demands also pose a challenge, as high-quality voice synthesis requires substantial resources, making real-time processing with minimal latency difficult to achieve on low-power devices.

Another critical aspect is personalization and adaptability, where AI systems must be capable of adapting to individual voices and speaking styles to create personalized speech generation. Despite these challenges, the benefits of AI voice synthesis are profound, particularly in enhancing accessibility. AI-generated voices can provide high-quality audio readings of text for visually impaired individuals, offering greater access to information and literature. These tools can also assist those who have difficulty reading due to dyslexia or other learning disabilities by converting written content into easily understandable audio.

Beyond accessibility, AI voice technology has the potential to revolutionize education by enabling the creation of personalized and engaging learning materials. It can enhance user experiences in customer service, entertainment, and communication by providing natural-sounding, responsive, and emotionally expressive voices in various applications. As AI voice synthesis continues to advance, it is essential to balance its powerful capabilities with considerations of ethical use, ensuring that the technology serves to enhance human experience while safeguarding against potential harms. The potential for misuse, such as creating deepfakes or impersonating individuals, underscores the need for responsible use and ethical guidelines in the development and deployment of AI voice technologies.

3. AI Sound Effects

ElevenLabs Text to Sound Effects

AI tools now let creators quickly generate realistic and imaginative sound effects using just text prompts. These tools are useful for film, games, VR, and interactive media, making it easier to create immersive soundscapes without recording or downloading audio libraries.

You can describe a sound—like “wind through trees at night” or “robot footsteps”—and the AI generates audio that fits. This opens up new ways to create sound effects, especially for small teams or solo creators.

Popular AI Sound Effect Tools (2025)

ElevenLabs SFX – Text-to-sound tool that creates custom effects instantly.
Adobe Firefly (Audio) – Create professional sound effects and ambiance directly from prompts or uploaded samples.
Stable Audio – Open tool for generating high-quality audio and sound effects from text.
Canva AI Audio – Easy text-to-sound generation inside Canva for video and media projects.
Boomy – Create music and sound effects quickly for video, podcasts, or games.
Mubert – Generates continuous music and SFX loops based on mood and genre.

Experimental and Research Tools

AudioCraft (Meta) – Open-source platform for music and sound effect generation.
YingSound – Matches sound to video events automatically (e.g., footsteps synced to motion).
Sketch2Sound – Turns sound “sketches” into full effects using prompts.
MultiFoley – Generates Foley audio from silent video clips using AI.

These tools give storytellers the power to create original audio faster and more easily than ever. They’re also great for teaching students how sound design supports visual storytelling—and how to shape that with AI.

4. Generative Audio Tools

AI has made music and sound creation more accessible than ever. With just a few prompts, creators can compose music, build soundscapes, and remix audio—without needing deep technical skills or traditional software.

These tools are great for filmmakers, podcasters, educators, game developers, and anyone experimenting with sound. They save time, reduce cost, and allow for rapid prototyping of ideas.

Top AI Audio Tools in 2025

Stable Audio – Generates music and effects from text prompts or seed samples.
Boomy – Create original tracks and distribute them on streaming platforms.
Suno.ai – Turn written prompts into full songs or audio scenes.
Soundful – Make royalty-free music for podcasts, video, or games.
Soundraw – AI music generator with mood, genre, and instrument settings.
Mubert – Continuous AI music tailored for background use or live mixes.
Ecrett Music – Create music by choosing scene, mood, and genre—great for beginners.
Loudly – Royalty-free music platform with AI-powered composition tools.
AIVA – Advanced AI music composition for games, films, and ads.
Magenta Studio – Music generation plugins built on Google’s open-source Magenta tools.
Riffusion – Uses spectrograms to turn text into visual music in real time.
Descript – Audio and video editing with AI voice tools and music layering.
LANDR – AI for mastering, collaboration, and distributing original music.
Google Tone Transfer – Converts instrument sounds into new timbres, e.g., saxophone to violin.

These platforms help anyone—from hobbyists to pros—quickly create compelling audio and experiment with new sound styles. They're transforming music from something made only in studios to something made anywhere.

5. AI in Music

"Break Free" – The first AI-composed pop song by Taryn Southern

Björk’s AI collaboration in "Kórsafn" | Microsoft In Culture

AI is changing how music is made, performed, and understood. It helps artists write songs, compose melodies, remix audio, and even perform live. Projects like Taryn Southern’s I AM AI and Björk’s Kórsafn show how human musicians and AI can work together to create original works.

AI acts as a creative partner—offering new sounds, suggesting structures, or transforming existing ideas. It doesn’t replace musicians but expands what’s possible in composition and sound design.

Tools now analyze music structure, harmonies, rhythms, and emotions. This lets artists refine their work with better insight and control. AI mastering and production tools give more creators access to high-quality results without studio resources.

In live shows, AI can improvise with musicians, creating responsive and evolving soundscapes. This real-time collaboration adds depth to performances and creates entirely new concert experiences.

As AI becomes more common in the music world, the focus shifts from whether machines can create art to how humans and machines can create together. The future of music lies in this creative partnership.

6. Unit Exercise

In this assignment, you'll use AI tools to produce a short original audio project. It can be a soundtrack, concept album, or audio narrative combining music, voice, and sound effects. Your goal is to push AI tools beyond imitation—create something strange, beautiful, and new.

Step 1: Create a Fictional Band

Band Identity: Use ChatGPT to invent a band name, genre, and backstory. Include member names, personalities, and roles.
Style Exploration: Define the band’s musical style and influences. Think experimental: noise, jazz, ambient, glitch pop, AI opera, etc.

Step 2: Produce Tracks

Compose with AI: Use tools like Suno, Soundraw, or Boomy to generate a few short tracks.
Design Visuals: For each track, use an image generator like Midjourney or DALL·E to create a cover that reflects the song's mood.

Step 3: Refine and Remix

Use traditional audio editing tools (e.g., Audacity, GarageBand, Ableton) to refine the AI tracks.
Try remixing, layering, or adding live recordings to shape the final piece.
Write brief liner notes: What is the song about? What did the AI generate vs. what did you shape?

This project is a chance to explore collaboration with machines as creative partners. Use iteration, imagination, and your own judgment to guide the tools toward something expressive and original.

7. Discussion Questions

AI is rapidly changing how music, voice, and sound are created, raising both creative and ethical questions. These discussion prompts invite deeper thinking about how we balance artistic innovation with human values in the age of AI:

How is AI expanding what’s possible in music and audio? Are we entering a new era of creative collaboration—or simply automating what humans already do?
Can AI-generated music and voice truly evoke deep emotion? Or is there something essential about human expression that machines can’t replicate?
As AI voices become more realistic, how do we protect personal vocal identities? What rights should people have over their voice being cloned or manipulated?
How do AI-generated soundscapes affect storytelling in games, film, and VR? Can they enhance immersion, or do they risk sounding artificial and impersonal?
What does meaningful collaboration between human artists and AI look like? How do we ensure the human voice—literally and metaphorically—remains central?
What ethical guidelines should be in place to prevent misuse of AI in audio, while still encouraging innovation and cultural diversity?

8. Bibliography

Briot, Jean-Pierre, Gaëtan Hadjeres, and François-David Pachet. Deep Learning Techniques for Music Generation. Springer, 2019.
du Sautoy, Marcus. The Creativity Code: How AI is Learning to Write, Paint and Think. Harvard University Press, 2019.
Huang, Cheng-Zhi Anna, et al. "Music Transformer: Generating Music with Long-Term Structure." Proceedings of the International Conference on Learning Representations (ICLR), 2019.
Oord, Aaron van den, et al. "WaveNet: A Generative Model for Raw Audio." Proceedings of the 9th ISCA Speech Synthesis Workshop, 2016.
Roads, Curtis. The Music Machine: Selected Readings from Computer Music Journal. MIT Press, 1989.
Sturm, Bob L., and Oded Ben-Tal. "AI in Music: The Challenges and Opportunities." Journal of New Music Research, vol. 48, no. 4, 2019, pp. 321–333.