Computer Science Department
School of Computer Science, Carnegie Mellon University
Volumetric Features for Video Event Detection
The amount of digital video has grown exponentially in recent years. We are at a nexus in time where video capture technology, computing power, storage capacity, and broadband networking have matured sufficiently to fuel an explosion in consumer videos. A key part of this ecosystem is the ability to search over vasts amounts of video data. While traditional methods have relied on text, such as those extracted from closed captioning, speech analysis, or manual annotation, we would to like search based on the automated recognition of the visual events in the video. This would enable more general searches to be performed without relying on previously labeled data. We propose a method for visual event detection of human actions that occur in crowded, dynamic environments. Crowded scenes pose a difficult challenge for current approaches to video event detection because it is difficult to segment the actor from the background due to distracting motion from other objects in the scene. We propose a technique for event recognition in crowded videos that reliably identifies actions in the presence of partial occlusion and background clutter. Our approach is based on three key ideas: (1) we efficiently match the volumetric representation of an event against over-segmented spatio-temporal video volumes; (2) we augment our shape-based features using flow; (3) rather than treating an event template as an atomic entity, we separately match by parts (both in space and time), enabling robustness against occlusions and actor variability. Our experiments on human actions, such as picking up a dropped object or waving in a crowd show reliable detection with few false positives.