Teaching an AI to Detect Key Actors in Multi-person Videos

Researchers from Google and Stanford have taught their computer vision model to detect the most important person in a multi-person video scene – for example, who the shooter is in a basketball game which typically contains dozens or hundreds of people in a scene.

Using 20 Tesla K40 GPUs and the cuDNN-accelerated Tensorflow deep learning framework to train their recurrent neural network on 257 NCAA basketball games from YouTube, an attention mask selects which of the several people are most relevant to the action being performed, then tracks relevance of each object as time proceeds. The team published a paper detailing more of their work.

Multi People Important Target
The distribution of attention for the model with tracking, at the beginning of “free-throw success”. The attention is concentrated at a specific defender’s position. Free-throws have a distinctive defense formation, and observing the defenders can be helpful as shown in the sample images in the top row.

Over time the system can identify not only the most important actor, but potential important actors and the events with which they are associated – such as, the ability to understand the player going up for a layup could be important, but that the most important player is the one who then blocks the shot.