Computer Science Department
School of Computer Science, Carnegie Mellon University


Advanced Tools for Video and Multimedia Mining

Jia-Yu Pan

May 2006

Ph.D. Thesis

Keywords: Multimedia data mining, video mining, multi-modal pattern discovery, biomedical data mining, independent component analysis, random walk with restarts, image captioning, time series and text mining

How do we automatically find patterns and mine data in large multimedia databases, to make these databases useful and accessible? We focus on two problems: (1) mining "uni-modal patterns" that summarize the characteristics of a data modality, and (2) mining "cross-modal correlations" among multiple modalities. Uni-modal patterns such as "news videos have static scenes and speech-like sounds," and cross-modal correlations like "the blue region at the upper part of a natural scene image is likely to be the `sky'," could provide insights on the multimedia content and have many applications.

For uni-modal pattern discovery, we propose the method "AutoSplit." AutoSplit provides a framework for mining meaningful "independent components" in multimedia data, and can find patterns in a wide variety of data modalities (e.g., video, audio, text, and time sequences). For example, in video clips, AutoSplit finds characteristic visual/auditory patterns, and can classify news and commercial clips with 81% accuracy. In time sequences like stock prices, AutoSplit finds hidden variables like "general growth trend" and "Internet bubble," and can detect outliers (e.g., lackluster stocks). Based on AutoSplit, we design a system, ViVo, for mining biomedical images. ViVo automatically constructs a visual vocabulary which is biologically meaningful and can classify 9 biological conditions with 84% accuracy. Moreover, ViVo supports data mining tasks such as highlighting biologically interesting image regions, for biomedical research.

For cross-modal correlation discovery, we propose "MAGIC," a graph-based framework for multimedia correlation mining. When applied to news video databases, MAGIC can identify relevant video shots and transcript words for event summarization. On the task of automatic image captioning, MAGIC achieves a relative improvement of 58% in captioning accuracy as compared to recent machine learning techniques.

212 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by