Google Researchers Use AI to Bring Still Photos to Life

Researchers from Google developed a deep learning-based system that can create short video clips from still images shot on stereo cameras, VR cameras, and dual lens cameras, such as an iPhone 7 or X.

“Given two images with known camera parameters, our goal is to learn a deep neural net to infer a global scene representation suitable for synthesizing novel views of the same scene, and in particular extrapolating beyond the input views,” the researchers wrote in their research paper.

Using NVIDIA Tesla P100 GPUs and the cuDNN-accelerated TensorFlow deep learning framework the team trained their system on over 7000 real estate videos posted on YouTube.

“Our view synthesis system based on multiplane images {MPIs) can handle both indoor and outdoor scenes,” the researchers said.  “We successfully applied it to scenes which are quite different from those in our training dataset. The learned MPIs are effective at representing surfaces which are partially reflective or transparent.”

The team says their system performs better than previous methods, and can effectively magnifying the narrow baseline of stereo imagery captured by cell phones and stereo cameras.

“We show that our method achieves better numeral performance on a held-out test, and also produces more spatially stable output imagery since our inferred scene representation is shared for synthesizing all target views.”

Overview of the Google Research end-to-end learning pipeline. Given an input stereo image pair, the team uses a fully-convolutional deep network to infer the multiplane image representation. For each plane, the alpha image is directly predicted by the network, and the color image is blended by using the reference source and the predicted background image, where the blending weights are also output from the network. During training, the network is optimized to predict an MPI representation that reconstructs the target views using a differentiable rendering module. During testing, the MPI representation is only inferred once for each scene, which can then be used to synthesize novel views with minimal computation (homography + alpha compositing).

The team concedes their model isn’t perfect, but they believe the method can be used to extrapolate data from two input images, generate light fields allowing view movement in multiple dimensions.

The research was published on ArXiv today.

Read more>