NVIDIA Research Puts Smart Picking Within Grasp

As another step toward enabling robots to work effectively in complex environments, robotics researchers from NVIDIA have developed a novel deep learning-based system that allows a robot to perceive household objects in its environment for the purpose of grasping the objects and interacting with them.

With this technique, the robot is able to perform simple pick-and-place operations on known household objects, such as handing an object to a person or grasping an object out of a person’s hand.

The research was first introduced at the Conference on Robot Learning (CoRL) in Zurich, Switzerland. The work is now being shown at the Conference on Neural Information Processing Systems in Montreal, Canada (NeurIPS) this week. 

The research, which builds on previous work developed by NVIDIA researchers, allows robots to precisely infer the pose of objects around them from a standard RGB camera.  Knowing the 3D position and orientation of objects in a scene, often referred to as 6-DoF (degrees of freedom) pose is critical, as it allows robots to manipulate objects even when those objects are not in the same place every time.

“We want robots to be able to interact with their environment in a safe and skillful manner,” said Stan Birchfield, a Principal Research Scientists at NVIDIA. “With our algorithm, and a single image, a robot can infer the 3D pose of an object for the purpose of grasping and manipulating it,” he explained.

The algorithm, which performs more robustly than leading methods, aims to solve a disconnect in computer vision and robotics, namely, that most robots currently do not have the perception they need to be able to handle disturbances in the environment.  This work is important because it is the first time in computer vision that an algorithm trained only on synthetic data (generated by a computer) is able to beat a state-of-the-art network trained on real images for object pose estimation on several objects of a standard benchmark.  Synthetic data has the advantage over real data in that it is possible to generate an almost unlimited amount of labeled training data for deep neural networks.

“Most industrial robots being sold today lack perception, they don’t really have a sense of the world around them,” Birchfield explained. “We’re laying the groundwork for the next generation robot, and we’re a step closer to collaborative robots with this work.”  

Using NVIDIA Tesla V100 GPUs on a DGX Station, with the cuDNN-accelerated PyTorch deep learning framework, the researchers trained a deep neural network on synthetic data generated by a custom plugin developed by NVIDIA for Unreal Engine.  This plugin is publicly available for other researchers to use.

“Specifically, we use a combination of non-photorealistic domain randomized (DR) data and photorealistic data to leverage the strengths of both,” the researchers stated in their paper. “These two types of data complement one another, yielding results that are much better than those achieved by either alone. Synthetic data has an additional advantage in that it avoids overfitting to a particular dataset distribution, thus producing a network that is robust to lighting changes, camera variations, and backgrounds,” the team explained.

Example images from the domain randomized (left) and photorealistic (right) datasets used for training

Inference was performed on an NVIDIA TITAN X GPU. The inference code is also publicly available

“We have shown that a network trained only on synthetic data can achieve state-of-the-art performance compared with a network trained on real data, and that the resulting poses are of sufficient accuracy for robotic manipulation.”

The NVIDIA team was comprised of researchers Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield.

Read more>