Keynote 1: Real-World Video AI
Cees Snoek, University of Amsterdam
Progress in video understanding has been astonishing in the past decade. Classifying actions, tracking objects and even segmenting actor instances at the pixel level is now common place, thanks to data-and-label supervised deep learning. Yet, it is becoming increasingly clear that deep learning architectures for video understanding may perform well on academic datasets constrained on domain diversity, recording circumstances and label vocabulary, but have difficulty generalizing to video in an open world where sensory, spatiotemporal and semantic conditions will differ considerably from those perceived during training. In this talk I will present recent work from my lab that strives for real-world video understanding by 1) leveraging sound, in addition to sight, in challenging vision conditions, 2) localizing previously unseen activities in space and time from a few video examples without class labels, interval bounds, or bounding boxes, and 3) recognizing and localizing previously unseen actions without any examples, by relying on off-the-shelf object detectors and multi-lingual word embeddings. Our experiments demonstrate state-of-the-art ability in the traditional closed world setting, while enabling video recognition and retrieval in an open world.