Musical AI is fast evolving. In March, Google released an algorithmic Google Doodle that let users create melodic homages to Bach. And late last year, Project Magenta, a Google Brain effort “exploring the role of machine learning as a tool in the creative process,” presented Musical Transformer, a model capable of generating songs with recognizable repetition.
In what might be characterized as a small but noteworthy step forward in autonomous music generation research, San Francisco capped-profit firm OpenAI today detailed MuseNet, an AI system that can create four-minute compositions with 10 different instruments across styles “from country to Mozart to the Beatles.”
OpenAI plans to livestream pieces composed by MuseNet on Twitch later today starting at 12 p.m. Pacific, and has released a MuseNet-powered music tool that’ll be available through May 12. The MuseNet composer has three modes: simple mode, which plays an uncurated sample from a composer or style (and an optional start of a famous piece), and advanced mode, which lets you can interact with the model directly to create a novel piece.
Here’s MuseNet prompted with the first 5 notes of Chopin:
As OpenAI technical staff member Christine Payne explains in a blog post, MuseNet, as with all deep neural networks, contains neurons (mathematical functions loosely modeled after biological neurons) arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength — weights — of each connection. But uniquely, it has attention: Every output element is connected to every input element, and the weightings between them are calculated dynamically. MuseNet isn’t explicitly programmed with an understanding of music, then, but instead discovers patterns of harmony, rhythm, and style by learning to predict tokens — notes encoded in a way that combines the pitch, volume, and instrument information — in hundreds of thousands of MIDI files. (It’s informed by OpenAI’s recent work on Sparse Transformer, which in turn was based on Google’s Transformer neural network architecture.)
MuseNet was trained on MIDI samples from a range of different sources, including ClassicalArchives, BitMidi, and the open source Maestro corpus. Payne and colleagues transformed them in various ways to improve the model’s generalizability, first by transposing them (by raising and lowering the pitches) and then by turning up or turning down the overall volumes of the various samples and slightly slowing or speeding up the pieces. To lend more “structural context,” they added mathematical representations (learned embeddings) that helped to track the passage of time in MIDI files. And they implemented an “inner critic” component that predicted whether a given sample was truly from the data set or if it was one of the model’s own past generations.
MuseNet’s additional token types — one for composer and another for instrumentation — afford greater control over the kinds of samples it can generate, Payne explains. During training, they were prepended to each music sample so that MuseNet learned to use them information in making note predictions. Then, at generation time, the model was conditioned to create samples in a chosen style by starting with a prompt like a Rachmaninoff piano start or the band Journey’s piano, bass, guitar, and drums.
“Since MuseNet knows many different styles, we can blend generations in novel ways,” she added. “[For example, the model was] given the first six notes of a Chopin Nocturne, but is asked to generate a piece in a pop style with piano, drums, bass, and guitar. [It] manages to blend the two styles convincingly.”
Payne notes that MuseNet isn’t perfect — because it generates each note by calculating the probabilities across all possible notes and instruments, it occasionally makes unharmonious choices. And predictably, it has a difficult time with incongruous pairings of styles and instruments, such as Chopin with bass and drums.
But she says that it’s an excellent test for AI architectures with attention, because it’s easy to hear whether the model is capturing long-term structure on the training data set’s tokens. “It’s much more obvious if a music model messes up structure by changing the rhythm, in a way that it’s less clear if a text model goes on a brief tangent,” she wrote.