OpenAI today released Jukebox, a machine learning framework that generates music — including rudimentary songs — as raw audio in a range of genres and musical styles. Provided with a genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch. The code and model are available on GitHub, along with a tool to explore the generated samples.
Jukebox might not be the most practical application of AI and machine learning, but as OpenAI notes, music generation pushes the boundaries of generative models. Synthesizing songs at the audio level is challenging because the sequences are quite long — a typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. As a result, learning the high-level semantics of music requires models to deal with very long-range dependencies.
Here’s a Jukebox-generated country song in the style of Alan Jackson:
Here’s classic pop in the style of Frank Sinatra:
And here’s jazz in the style of Ella Fitzgerald:
Jukebox tackles this by using what’s called an autoencoder, which compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. The model can then be trained to generate audio in this space and upsample back to the raw audio space.
Jukebox’s autoencoder model processes audio with an approach called Vector Quantized Variational AutoEncoder (VQ-VAE). Three levels of VQ-VAE compress 44kHz raw audio by 8 times, 32 times, and 128 times; the bottom-level encoding (8 times) produces the highest-quality reconstruction (in the form of “music codes”) while the top-level encoding (128 times) retains only essential musical information, such as the pitch, timbre, and volume.
A family of prior models — a top-level prior that generates the most compressed music codes encoded by VQ-VAE and two upsampling priors that synthesize less compressed codes — within Jukebox were trained to learn the distribution of the codes and generate music in the compressed space. The top-level prior models the long-range structure of music so that samples decoded from it have lower audio quality but capture high-level semantics (like singing and melodies), while the middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality.
Model training was performed using a simplified variant of OpenAI’s Sparse Transformers architecture against a corpus of 1.2 million songs (600,000 in English), which were sourced from the web and paired with both lyrics and metadata (e.g., artist, album genre, year, common mood, and playlist keywords) from LyricWiki. Every song was 32-bit at 44.1 kHz, and OpenAI augmented the corpus by randomly downmixing the right and left channels to produce mono audio.
To condition Jukebox on particular artists and genres, a top-level Transformer model was trained on the task of predicting compressed audio tokens, which enabled Jukebox to achieve better quality in any musical style and allowed researchers to steer the model to generate in a style of their choosing. And to provide the framework with more lyrical context, OpenAI developed an encoder that adds query-using layers from Jukebox’s music decoder to attend to keys and values from the lyrics encoder, allowing Jukebox to learn more precise alignments of lyrics and music.
Jukebox’s models required an immense amount of compute — and time — to train:
- The VQ-VAE, which contained over 2 million parameters (variables), was trained on 256 Nvidia V100 graphics cards for three days.
- The upsamplers, which contained over 1 billion variables, were trained on 128 Nvidia V100 graphics cards for two weeks.
- The top-level prior, which contained over 5 billion variables, was trained on 512 Nvidia V100 graphics cards for four weeks.
In all these respects, Jukebox is a quantum leap over OpenAI’s previous work, MuseNet, which explored synthesizing music based on large amounts of MIDI data. With raw audio, Jukebox models learn to handle diversity and long-range structure while mitigating errors in short-, medium-, or long-term timing. And the results aren’t half bad.
But Jukebox has its limitations. While the songs it generates are fairly musically coherent and feature traditional chord patterns (and even solos), they lack structures like repeating choruses. Moreover, they contain discernible noise, and the models are painfully slow to sample from — it takes 9 hours to render one minute of audio.
Fortunately, OpenAI plans to distill Jukebox’s models into a parallel sampler that should “significantly” speed up sampling. It also intends to train Jukebox on songs from other languages and parts of the world beyond English and the West.
“Our audio team is continuing to work on generating audio samples conditioned on different kinds of priming information. In particular, we’ve seen early success conditioning on MIDI files and stem files,” wrote OpenAI. “We hope this will improve the musicality of samples (in the way conditioning on lyrics improved the singing), and this would also be a way of giving musicians more control over the generations. We expect human and model collaborations to be an increasingly exciting creative space.”
Musical AI is fast evolving. In late 2018, Project Magenta, a Google Brain effort “exploring the role of machine learning as a tool in the creative process,” presented Musical Transformer, a model capable of generating songs with recognizable repetition. And last March, Google released an algorithmic Google Doodle that let users create melodic homages to Bach.
The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here