AI can generate storyboard animations from scripts, spot potholes and cracks in roads, and teach four-legged robots to recover when they fall. But what about adapting one person’s singing style to that of another? Yep — it’s got that down pat, too. In a paper published on the preprint server Arxiv.org (“Unsupervised Singing Voice Conversion“), scientists at Facebook AI Research and Tel Aviv University describe a system that directly converts audio of one singer to the voice of another. All the more impressive, it’s unsupervised, meaning it’s able to perform the conversion from unclassified, unannotated data it hasn’t previously encountered.
The team claims that their model was able to learn to convert between singers from just 5-30 minutes of their singing voices, thanks in part to an innovative training scheme and data augmentation technique.
“[Our approach] could lead, for example, to the ability to free oneself from some of the limitations of one’s own voice,” the paper’s authors wrote. “The proposed network is not conditioned on the text or on the notes [and doesn’t] require parallel training data between the various singers, nor [does it] employ a transcript of the audio to either text … or to musical notes … While existing pitch correction methods … correct local pitch shifts, our work offers flexibility along the other voice characteristics.”
As the researchers explain, their method builds on WaveNet, a Google-developed autoencoder (a type of AI used to learn representations for sets of data unsupervised) that generates models from the waveforms of audio recordings. And it employs backtranslation, which involves converting one data sample to a target sample (in this case, one singer’s voice to another) before translating it back and tweaking its next attempt if it doesn’t match the original. Additionally, the team used synthetic samples using “virtual identities” closer to the source singer than other speakers, and a “confusion network” that ensured the system remained singer-agnostic.
The AI was trained in two phases. First, a mathematical function known as a softmax reconstruction loss was applied to the samples of each singer separately, and then, samples of novel singers obtained by mixing the vector embeddings (i.e., numerical representations) of the training singers were generated prior to the backtranslation step.
To augment the training data sets, the authors transformed audio clips by playing the signals backward and by imperceptibly shifting the phase. “[It] increases by fourfold the size of the dataset,” they wrote. “The first augmentation creates a gibberish song that is nevertheless identifiable as the same singer; the second augmentation creates a perceptually indistinguishable but novel signal for training.”
In experiments, the team sourced two publicly available data sets — Stanford’s Digital Archive of Mobile Performances (DAMP) corpus and the National University of Singapore’s Sung and Spoken Corpus (NUS-48E) — containing songs performed by various singers. From the first, they selected five singers with 10 songs at random (nine songs of which they used to train the AI system), and from the second, they chose 12 singers with four songs for each singer, all of which they used for training.
They next had human reviewers judge on a scale of 1-5 the similarity of generated voices to the target singing voice, and used an automatic test involving a classification system to evaluate the samples’ quality a bit more objectively. The reviewers gave the converted audio an average score of about 4 (which is considered good quality), while the automated test found that the identification accuracy of the generated samples was almost as high as those of the reconstructed samples.
They leave to future work methods that can perform the conversion in the presence of background music.