Last year, Baidu unveiled its Deep Voice Ai, which could clone a human voice with just 30 minutes of training material. Since then, it’s become much better at it now, doing the same job with just a few seconds worth of audio. And last month, the Chinese AI titan announced that its novel AI had learned some new tricks. Instead of just cloning a voice faster than ever, the AI now can swap the gender of the voice, and remove the accent.
Deep Voice is better now
In a recently published whitepaper, the Baidu team discussed two different training method. One of them generates a more believable output but requires additional audio input. On the other hand, the second one can generate cloned audio much faster, but the quality isn’t that good. You can listen to some examples at the team’s Github page.
However, compared to Baidu’s prior attempt with Deep Voice, both the models are faster. The researchers also say that with tweaked algorithms and comprehensive dataset, the AI could be upgraded. In a company blog post, they claim:
“In terms of naturalness of the speech and similarity to the original speaker, both demonstrate good performance, even with very few cloning audios.”
Through this research, the Baidu team wants to prove that with the help of limited datasets, machines can also learn complex tasks, just like human being. Albeit mimicking voices seems to be a specific use-case, it’s quite important for the researchers to find ways to reduce footprints through fine-tuning or replacing clumsy algorithms.
“Humans can learn most new generative tasks from only a few examples, and it has motivated research on few-shot generative models,” the team says.
Deep Voice isn’t completely perfect yet, you will still find the AI’s voice a bit robotic. However, considering the fact that it was barely possible even a year ago, the progress is quite impressive.
Moving towards a fake world
While some people find this voice cloning AI quite interesting, I found it quite alarming. The world has already seen an AI swapping one person’s face with another person’s body, which was used for porn. Besides, NVIDIA developed AI that is able to generate realistic images of people who don’t even exist. So, it’s easy to understand what fake voice can do.
Apart from this, listening to Donald Trump’s speech in Queen Elizabeth’s voice with her British accent won’t be that bad.