Meet the AI Choreographer: This New Model Can Help You With Your Next Dance Video

To help automatically create a dance video, NVIDIA researchers in collaboration with University of California, Merced developed a deep learning-based model that can automatically compose new dance moves that are diverse, style-consistent, and match the beat.

“This is a challenging but interesting generative task with the potential to assist and expand content creations in arts and sports, such as a theatrical performance, rhythmic gymnastics, and figure skating,” the NVIDIA researchers stated in a paper presented this week at the 2019 Conference on Neural Information Processing Systems (NeurIPS 2019) in Vancouver, Canada.

*Generated Dance Sequences second row: music beats; third row: kinematic beats*

At the core of the work is a decomposition-to-compositions framework which first learns how to move, and then how to compose.

A schematic overview of the decomposition-to-composition framework. In the top-down decomposition phase, the team normalizes the dance units that are segmented from a real dancing sequence using a kinematic beat detector. They then train the DU-VAE to model the dance units. In the bottom-up composition phase, given a pair of music and dance, the team leverages the MM-GAN to learn how to organize the dance units conditioned on the given music. In the testing phase, the researchers extract style and beats from the input music, then synthesize a sequence of dance units in a recurrent manner, and in the end, apply the beat warper to the generated dance unit sequence to render the output dance.

To train the generative adversarial network used in the system, the team collected dance videos of three representative dance categories including Ballet, Zumba and Hip-Hop. In total, the team acquired more than 361,000 clips or approximately 71 hours of dancing footage.

For the pose processing, the team used OpenPose, an open-source, real-time multi-person system developed by Carnegie Mellon University that can jointly detect human body, hand facial and foot key points on single images.

Researchers’s Conference Submission Video

The work was trained using the PyTorch deep learning framework and NVIDIA V100 GPUs. For inference, the work uses the same GPUs used during training. In future iterations of the work, the team plans to add more dancing styles such as pop-dance and partner dance.

“Extensive qualitative and quantitative evaluations demonstrate that the synthesized dances by the proposed method are not only realistic and diverse but also style-consistent and beat-matching,” the researchers stated in their paper.

The source code and models will be published on GitHub after the conference.