CMU-CS-22-147
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-22-147

A Self-Supervised Study of Multimodal
Interactions in Opinion Videos

Jiaxin Shi

M.S. Thesis

August 2022

CMU-CS-22-147.pdf


Keywords: Sentiment Analysis, Multimodal Interactions, Self-Supervised Learning

Our experience of the world is inherently multimodal. Analyzing human multi- modal language is an increasingly popular area of research that often focuses on sentimental analysis and emotion recognition, where three main modalities are present: language, acoustic, and vision. The advancements in deep learning rely heavily on the abundance of data available for the model to learn rich patterns. Due to the heavy labor required to annotate large-scale data, it is beneficial to explore what we could achieve from self-supervised learning methods. In this work, we propose a self-supervised task to study the cross-modal interactions present in the multimodal language datasets (with language, acoustic and visual modalities). We study bimodal interactions between two source modalities through our proposed self-supervised task by generating the third modality, the target modality, given the two source modalities. In other words, we quantify the information overlap between the source and target modalities while studying which multimodal interactions are used for this self-supervised task. A secondary advantage of our proposed self-supervised task is that it can also be used in downstream tasks where one of the modalities is missing. Our approach builds on the intuition that observed modalities may be able to generalize information about the missing modality. For example, people may be able to imagine the voice of a speaker when watching muted videos. In summary, this thesis is a self-supervised study on multimodal interactions in opinionated videos. Our work investigates how much information overlap exists between different modalities, quantifies the amount of cross-modal interactions, and evaluates how much information can belearned from a missing modality given other available modalities.

39 pages

Thesis Committee:
Louis-Philippe Morency (Chair)
Robert E. Frederkin

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu