A Montreal-based startup developed a set of deep learning algorithms that can copy anyone’s voice with only 60 seconds of sample audio.
Lyrebird, a startup spin-off from the MILA lab at University of Montréal and advised by Aaron Courville and Yoshua Bengio claims to be the first of its kind to allow copying voices in a matter of minutes and control the emotion of the generation.
Using CUDA, TITAN X Pascal GPUs and cuDNN with the Theano deep learning framework, they trained their recurrent neural network on two speakers, one male and one female, each reading ten hours of audio books. Once trained, the algorithm is able to generate 1,000 sentences in less than half a second. Their related paper “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model” provides more details about their model.
The company unveiled an impressive public demo this week consisting of a series of audio samples from Donald Trump, Barack Obama, and Hillary Clinton – not completely believable… yet, but will improve over time:
The resulting speech can be put to a wide range of uses, says Lyrebird, including “reading of audio books with famous voices, for connected devices of any kind, for speech synthesis for people with disabilities, for animation movies or for video game studios.”
Lyrebird’s developer API is still under development with no timetable on the release, but more than 6,000 people have registered for early access.