Multimodal Machine Learning is a multi-disciplinary research field. It is based on integrating and modeling multiple modalities, e.g., acoustics and vision. This course includes fundamental concepts related to multimodal learning such as multimodal alignment, multimodal fusion, joint learning, temporal learning, multimodal representation learning. We will mainly cover the recent state-of-the-art papers, which propose effective computational algorithms for diverse application spectrum.
The first part mainly focuses on human face analysis from multimedia content, such as images and videos, together with the related machine learning methods. The tasks include both the traditional ones, (e.g., face detection & recognition, age & gender estimation) and some recent ones, (e.g., face synthesis & expression manipulation). In the meanwhile, the related deep learning approaches will also be introduced, such as the convolutional neural networks, recurrent neural networks, and some specific generative models.
The second part of this course particularly focus on human behavior understanding that includes verbal and nonverbal behavior analysis and multimodal affect recognition. In these contexts, several datasets, sensing approaches, computational methodologies allowing to detecting and understanding several social and psychological phenomena will be covered. We will also point out existing limitations and outline possible future directions.
The course evaluation will be based on a small project given to a group of students (i.e., teamwork).