Today, Facebook introduced a new feature that automatically generates text descriptions of pictures using advanced object recognition technology.
Until now, people using screen readers would only hear the name of the person who shared the photo, followed by the term “photo” when they came upon an image in News Feed. Now they will get a richer description of what’s in a photo. For instance, someone could now hear, “Image may contain three people, smiling, outdoors.”
The Facebook researchers noted that it took nearly ten months to roll the feature out publicly, as they had to train their deep learning models to recognize more than just the people in the images. For instance, since people mostly care about who is in the photo and what they are doing, but sometimes the background of the photo is what makes it interesting or significant.
While that may be intuitive to humans, it is quite challenging to teach a machine to provide as much useful information as possible while acknowledging the social context.
Their neural network models were trained on a million parameters, but they have carefully selected a set of about 100 concepts based on prominence in photos as well as the accuracy of the visual recognition system. They also avoided concepts that had very specific meanings like smiling, jewelry, cars, and boats. Currently, they are ensuring their object detection algorithm on the objects have a minimum precision rate of 0.8.