Hello World! Robot Responds to Human Gestures

By: Madeleine Waldie, Abhinav Ayalur, Jackson Moffet, and Nikhil Suresh
This summer a team of four high school interns, the Neural Ninjas, developed a gesture recognition neural network using Python and C++ to teach a robot to recognize a human wave.
Working with robots is familiar territory for them. They’re all members of the FIRST Robotics Competition (FRC), an international high school robotics competition.
“Working on this project at NVIDIA was an incredibly rewarding experience,” the students said. “Each of us learned valuable information about deep learning and how to implement it on a Jetson.”

Developing a Neural Network

The interns used NVIDIA Tesla V100 GPUs and the cuDNN-accelerated TensorFlow deep learning framework to train a fully connected feedforward neural network, which they call “Post OpenPose Neural Network (POPNN).” This detects a person waving even when there are multiple people in the frame.
As an added challenge, they used PyTorch to build a Long Short-Term Memory Neural Network (LSTM) that also detects waving.
“The LSTM was more complex than POPNN, so creating it involved a lot of creative thinking,” the interns said.
Both neural networks have built-in potential for multi-class detection, but the team found that the best models were binary. They were able to train good models that can detect waves, x-poses, y-poses, dabs, and none of the above.

Pose Estimation

The team used TF-Pose-Estimation, a TensorFlow implementation of the pose estimation neural network OpenPose, to classify images of humans and identify key points on the human body such as noses, elbows, and wrists.
To collect the data that’s input into the neural network, the team used a Jetson TX2 to run inference on TF-Pose-Estimation and saved the positions of the body parts from each frame. They then took the differences between the body part positions over four consecutive frames and saved them into a data file. After generating data files, it was input into POPNN.
Generating data is no easy task and can be very time-consuming. To make the data collection process easier, the team captured videos of people waving and not waving. They were able to use the videos to generate data, so they didn’t have to stand in front of a camera for hours on end.
To speed up TF-Pose-Estimation’s inference speed, the team multithreaded the different processes of the pose estimator. They divided the process into four separate threads – video, display thread, pose estimation, and optical flow. By doing this, the camera FPS became ten times faster.

Implementation on a Robot

The students then deployed their trained neural networks to a Jetson TX2, which they installed on a humanoid robot. The team used the Robot Operating System (ROS) to send data from the Jetson TX2 to the robot.
“When someone first enters the robot’s field of view, the system uses the ROS data to watch for a wave,” the interns said. “When the person begins waving, the robot turns its head, centers on the person within the frame, and then turns its body to face the person.”
Using the information sent from ROS, the robot gestures back. When someone waves at the robot, after the robot waves back, it will ask the person a question, such as “Are you enjoying your day so far?” Then, the human can answer the question with an x-pose or y-pose, signifying yes and no. Also, if someone dabs at the robot, the robot will dab back.
“In addition to gaining new technical skills, we learned what it’s like to work in a corporate environment. From brainstorming to the final stages of the project, we learned to collaborate, delegate tasks, and work more efficiently,” the interns said.
To learn more about the project, visit their GitHub repository.
https://github.com/NVIDIA-Jetson/Gesture-Recognition