Predicting what will happen in the future is challenging. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory developed an algorithm that can predict whether two individuals will hug, kiss, shake hands or slap five in the next scene.
Using a Tesla K40 GPU with the cuDNN-accelerated Caffe deep learning framework, the researchers trained their network on 600 hours of prime-time television shows including The Office and Desperate Housewives.
When predicting which of the four actions the person would perform one second later, the algorithm correctly predicted the action more than 43 percent of the time – and humans who have been watching TV for years were only able to predict the next action with 71 percent accuracy.
In their second study, the algorithm was shown frames from a video and asked it to predict what object will appear five seconds later. For example, if someone opens a microwave, it might predict that a coffee cup is likely to come out.
With more training, comes better predictive capability, MIT PhD student Carl Vondrick and co-author of the paper says this research could potentially be used for robots that develop better action plans, to recommendation systems that can suggest products or services based on what they anticipate a person will do.
“I’m excited to see how much better the algorithms get if we can feed them a lifetime’s worth of videos,” says Vondrick. “We might see some significant improvements that would get us closer to using predictive-vision in real-world situations.”