AI Helps Transform Audio Into Music Playing Avatars

Researchers from Facebook, Stanford, and the University of Washington developed a deep learning based method that can transform audio of musical instruments into skeleton predictions, which can be used to animate an avatar.

“The key idea is to create an animation of an avatar that moves their hands similarly to how a pianist or violinist would do, just from audio,” the researchers stated in their paper. “We believe the correlation between audio to human body is very promising for a variety of applications in VR/AR and recognition.”

Using NVIDIA Tesla GPUs the team trained their system on hours of violin and piano playing footage the researchers found on YouTube.

“The intuition behind our choice of video was to have clear high quality music sound, no background noise, no accompanying instruments, solo performance. On the video quality side, we searched for videos of high resolution, stable fixed camera, and bright lighting. We preferred longer videos for continuity.,” the researchers said.

Method overview: (a) The method gets as input an audio signal, e.g., piano music, (b) that is fed into our LSTM network to predict body movement points, (c) which in turn are used to animate an avatar and show it playing the input music on a piano (the avatar and piano are models while the rest is real apartment background).

To increase realism in the VR/AR predictions, the team will complement training data with sensor information or midi files, the team said.

Read more>