Estimating 6D Pose from Regular 2D Images with AI

Researchers from NVIDIA, along with collaborators from academia, developed a deep learning-based system that performs 6D object pose estimation from a standard 2D color image with superb accuracy.   

In robotics, a robotic arm needs to know the location and orientation to detect and move objects in its vicinity successfully.  This allows the robot to operate safely and effectively alongside humans. The awareness of the position and orientation of objects in a scene is sometimes referred to as 6D, where the D stands for degrees of freedom pose.

“Our method significantly outperforms the state-of-the-art 6D pose estimation methods using color images only. The performance of our method is already close to methods that use depth images for pose refinement such as using the iterative closest point algorithm,” the researchers stated in their paper.  

Using NVIDIA Tesla V100 GPUs on a DGX Station, with the cuDNN-accelerated MXNet framework, the team trained their system on thousands of images from the LINEMOD dataset.

“For every image, we generate 10 random poses near the ground truth pose, resulting in 2,000 training samples for each object in the training set,” the team said.  “Furthermore, we generate 10,000 synthetic images for each object where the pose distribution is similar to the real training set. Thus, we have a total of 12,000 training samples for each object in training.”

DeepIM uses a FlowNetSimple backbone to predict a relative transformation to match the observed and rendered image of an object. Additional mask and flow losses improve stability during training.

Once trained, the neural network automatically learns to match the pose of an object from the 2D color images. The neural network then outputs a relative pose transformation that can be applied to the initial pose, which improves 6D pose estimation, the team said.

For inference, the researchers use an NVIDIA GeForce GTX 1080 Ti GPU.

“This work opens up various directions for future research. For instance, we expect that a stereo version of DeepIM could further improve pose accuracy. Furthermore, DeepIM indicates that it is possible to produce accurate 6D pose estimates using color images only, enabling the use of cameras that capture high-resolution images at high frame rates with a large field of view, providing estimates useful for applications such as robot manipulation.”

The team, comprised of researchers from Tsinghua University, the University of Washington, and NVIDIA is presenting their research at ECCV in Munich, Germany this week.

Read more >