Google released the latest version of their automatic image captioning model that is more accurate, and is much faster to train compared to the original system.
“The TensorFlow implementation released today achieves the same level of accuracy with significantly faster performance: time per training step is just 0.7 seconds in TensorFlow compared to 3 seconds in DistBelief (a system Google previously used for generating image captions) on an NVIDIA K20 GPU, meaning that total training time is just 25 percent of the time previously required,” Chris Shallue, Software Engineer of the Google Brain Team wrote in a blog post.
Using CUDA and the TensorFlow deep learning framework, Google trains Show and Tell by letting it take a look at images and captions that people wrote for those images. Sometimes, if the model thinks it sees something going on in a new image that’s exactly like a previous image it has seen, it falls back on the caption for the caption for that previous image. But at other times, Show and Tell is able to come up with original captions. “Moreover,” Shallue wrote, “it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.”
The initial training phase took nearly two weeks on a single Tesla K20 GPU, but they mention it would be 10 times slower if you were to run the code on a CPU.