Leonid Sigal

Associate Professor, University of British Columbia

Course Information

Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).

In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.

Leonid Sigal (lsigal@cs.ubc.ca)

Suhail Mohammed (suhail33@cs.ubc.ca)
Tanzila Rahman (tanzila.himu@gmail.com)

Office hours:
TBD and by appointment (all communication to go through piazza)

Class meets:
Tuesday, Thursday 11:00 - 12:30 pm, Zoom


Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.

Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We also encourage you to use Google Colab which should be sufficient for your assignment but likely not for the project. You may also register for student accounts with Amazon AWS (free $25 credit) or with Micrasoft Azure (free $100 credit). Note that while TAs will do their best to help with your specific enviornment setup, due to heterogeneity of various setups that may results from these choices, it is really up to the student to ultimately sort out the details.

Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome to contact me and I will asses feasibility of letting you attend the lectures via Zoom.


Assignments (five assignments in total) 40%
   Assignments #0: Introduction to PyTorch (0% -- ungraded)
   Assignments #1: Neural Networks Introduction (5%)
   Assignments #2: Convolutional Neural Networks (5%)
   Assignments #3: Recurrent Neural Network Language Models (10%)
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%)
   Assignments #5: Advanced Neural Architectures (10%)
Research papers20%
   Readings and reviews: Two papers a week after the break (10%)
   Presentations and discussion: One paper per semester (10%)
Project (proposal, final presentation and web report)40%

Assignments (40% of the grade)

Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on the grade as well as overall understanding of the material.

Research papers (20% of the grade)

In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation. Note that the expectation is that all students need to attend all classes, including those where your peers present.

Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).

Structure of the reviews:
  Short summary of the paper (3-4 sentences)
  Main contributions (2-3 bullet points)
  Positive and negatives points (2-3 bullet points each)
  What did you not understand or was unclear about the paper? (2-3 bullet points)

Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Canvas.

Paper Presentation: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. We are likely to resort to pre-reocrded presentations, rather than the live ones. You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. More details will be given as we get closer to paper readings and presentations.

Structure of the paper presentation:
  High-level overview of the problem and motivation
  Clear statement of the problem
  Overview of the technical details of the method, including necessary background
  Relationship of the approach and method to others discussed in class
  Discussion of strengths and weaknesses of the approach
  Discussion of strengths and weaknesses of the evaluation
  Discussion of potential extensions (published or potential)

Project (40% of the grade)

Details on projects to follow ....


Date Topic Reading
W1: Jan 12 Introduction to the Course (slides)
- What is multi-modal learning?
- Challenges in multi-modal learning
- Course expectations and grading
(optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser
Multimodal Machine Learning: A Survey and Taxonomy by Baltrusaitis, Ahuja and Morency
W1: Jan 14 Assignment 0 out (download)
Credit: Assignment 0 is adopted from the course assignment given out by Justin Johnson in EECS 498/598 at University of Michigan (see link). The adoptation was done by our own Suhail Mohammed.
W1: Jan 14 Introduction to Deep Learning [Part 1] (slides)
- Multi-layer Perceptron (MLP)
- Stochastic Gradient Descent
- Computational graphs
- NN as Universal Approximators

Assignment 1 out (download)
Deep Learning in Nature by LeCun et al.
Automatic Differentiation in Machine Learning: a Survey by Baydin et al.
W2: Jan 19 Introduction to Deep Learning [Part 2] (slides)
- Regularization (L1, L2, batch norm, dropout)
- Terminology and practical advice on optimization
- Simple loss functions
- Structure, parameters and hyper-parameters
W2: Jan 21 Introduction to Computer Vision (slides)
- History
- Basic operations and problems
- Image filtering and features

Convolutional Neural Networks [part 1] (slides)
- CNN Basics
- CNN as a feature representation

Assignment 1 due
Assignment 2 out (download, data [~10gb])
Chapter 9, 9.1-9.3 of Deep Learning Book
W3: Jan 26 Convolutional Neural Networks [part 2] (slides)
- Regularization, Data Augmentation
- Pre-training and transferability
- AlexNet, VGG
CNNs for Computer Vision by Srinivas et al
W3: Jan 28 Convolutional Neural Networks [part 3] (slides)
- GoogleLeNet, ResNet
- Segmantation networks
- Fully convolutional networks (transpose convolutions)
W4: Feb 1 Assignment 2 due
W4: Feb 2 Convolutional Neural Networks [part 4] (slides)
- CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, Mask RCNN, YOLO)

Visualizing CNNs (slides)
- Guided BackProp
- Gradient ascent
- Adversarial examples

Assignment 3 out (download)
R-CNN by Girshick et al
Mask R-CNN by He et al
A Survey of Deep Learning-based Object Detection Jiao et al
W4: Feb 4 Introduction to Natural Language Processing (slides)
- Tasks in NLP
- Why NLP is difficult
- Representing words and text

Recurrent Neural Networks [part I] (slides)
- Recurrent Neural Networks
- Long Short Term Memory Networks (LSTMs)
Word Representations in Vector Space by Mikolov et al
Chapter 10 of Deep Learning Book
W5: Feb 9 Recurrent Neural Networks [part 2] (slides)
- Recurrent Neural Networks
- Long Short Term Memory Networks (LSTMs)
- Gated Recurrent Units (GRUs)
W5: Feb 11 Recurrent Neural Networks [part 3] (slides)
- Encoder-decoder RNNs
- Translation models
- Attention models
- Transformer
W6: Feb 16 No Class: Winter Break
W6: Feb 18 No Class: Winter Break
W5: Feb 22 Assignment 3 due
W7: Feb 23 Recurrent Neural Networks [part 4] (slides)
- Word Vector Representations (CBOW, Skip-gram)
- Applications: Language Translation, BERT, Image Captioning

Assignment 4 out (download)
W7: Feb 25 Recurrent Neural Networks [part 5] (slides)
- Applications: Quation Answering, Activity Recognition

Unsupervised Representation Learning [part I] (slides)
- Autoencoders, Denoising Autoencoders
- Stacked Autoencoders, Context Encoders
- Bottleneck Theory
W8: Mar 2 Unsupervised Representation Learning [part 2] (slides)
- Denoising Autoencoders
- Stacked Autoencoders, Context Encoders
- Bottleneck Theory

Multimodal Learning [part I] (slides)
- Intro to Multimodal Learning
- Multimodal Joint Representations
Unified Visual-Semantic Embeddings by Kiros et al
W8: Mar 4 Multimodal Learning [part 2] (slides)
- Canonical Correlation Analysis (CCA)
- Joint embedding models
- Applications

W9: Mar 8 Assignment 4 due
W9: Mar 9 Multimodal Learning [part 3] (slides)
- Applications

Generative Models [part 1] (slides)
- PixelRNN, PixelCNN
W9: Mar 11 Generative Models [part 2] (slides)
- Variational Autoencoders (VAEs)
- Applications
W10: Mar 15 Project Proposals due
W10: Mar 16 Generative Models [part 3] (slides)
- Conditional VAEs, Temporal VAEs - VAE Applications - Generative Adversarial Networks (GANs)
- DCGAN, Conditional GAN
W10: Mar 18 Generative Models [part 4] (slides)
- Image-to-Image Tranlation: pix2pix, CycleGAN
- Laplacyan Pyramid GAN, InfoGAN, Adversarial Autoencoders
- Hybrit VAE + GAN Architectures (e.g., layour-to-image)

Graph Neural Networks [part 1] (slides)
- Graph Convolutional Neural Networks (Graph CNNs)
- GNNs with Edge Embedding
W11: Mar 23 Graph Neural Networks [part 2] (slides)
- GNNs with Attention
- GNN Applications: Language Grounding, Scene Graph Generation

Deep Reinforcement Learning [part 1] (slides)
- Introduction
W11: Mar 25 Deep Reinforcement Learning [part 2] (slides)
- Value-based RL, Policy-based RL, Q-Learning, REINFORCE
- RL Applications

Assignment 5 out (download)
W12: Mar 30 Guest Lecture: Recent Advances in Vision-and-Language Navigation

Peter Anderson, Research Scientist at Google

Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural language navigation instructions in a previously unseen photorealistic environment. It's a challenging embodied AI problem that's seen renewed interest, driven in part by the increasing availability of high-quality 3D reconstructions to use as training environments. In this talk I will introduce the problem and the main datasets (including RxR, a new large-scale multilingual dataset), discuss the role of 'speaker' models for pragmatic reasoning and data augmentation, and present a first attempt at sim-to-real transfer to a robot. Finally, I will introduce our recent efforts towards building a generic visual world model for indoor navigation that can predict around corners.

Bio: I am a Research Scientist in the Language team at Google Research. My research interests include computer vision, natural language processing, and problems at the intersection of these fields in particular. My recent work has focused on grounded language understanding, particularly in large-scale visually-realistic 3D environments. I completed my PhD in Computer Science at the Australian National University in 2018 where I was advised by Stephen Gould. Prior to joining Google I was a Research Scientist at Georgia Tech working with Dhruv Batra and Devi Parikh.
W12: Apr 1 Guest Lecture: DIDAN: Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News

Kate Saenko, Associate Professor at Boston University

Abstract: Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of de- fending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study, we provide a relatively effective approach based on detecting visual-semantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation.

Bio: Kate Saenko is an Associate Professor at the Department of Computer Science at Boston University, and the director of the Computer Vision and Learning Group and member of the IVC Group. She received her PhD from MIT. Previously, she was an Assistant Professor at the Department of Computer Science at UMass Lowell, a Postdoctoral Researcher at the International Computer Science Institute, a Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests are in the broad area of Artificial Intelligence with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.
W13: Apr 6 Guest Lecture: Measuring and Mitigating Biases in Vision and Language

Vicente Ordóñez Román, Assistant Professor at University of Virginia

Abstract: Deep learning and large-scale datasets have enabled increasingly accurate models that are able to capture a multitude of patterns and visual cues that would be hard to manually engineer. Despite this progress, some of the patterns learned by these models remain obscure and some might not generalize well beyond specific datasets. In this talk, I will present some of our work aimed at uncovering, measuring, and mitigating some of the biases introduced by vision and language models with emphasis on the issue of bias amplification. We have shown that models not only tend to replicate biases present in the data but also amplify them -- even when these biases are with respect to latent variables and when reasonable effort was dedicated to produce a balanced and somewhat diverse dataset. We find some of these issues arise in both computer vision and natural language processing problems.

Bio: Vicente Ordonez is assistant professor in the Department of Computer Science at the University of Virginia. His research interests lie at the intersection of computer vision, natural language processing and machine learning. He is a recipient of a Best Paper Award at the conference on Empirical Methods in Natural Language Processing (EMNLP) 2017 and the Best Paper Award -- Marr Prize -- at the International Conference on Computer Vision (ICCV) 2013 . Vicente obtained his PhD in Computer Science at the University of North Carolina at Chapel Hill and in the past, he has also been Visiting Fellow at the Allen Institute for Artificial Intelligence and Visiting Professor at Adobe Research.

Paper Presentation due
W13: Apr 8 Guest Lecture: Learning from Unlabeled Videos

Yale Song, Senior Researcher at Microsoft Research

Abstract: Videos provide new opportunities in self-supervised learning with dynamic signals absent from images, such as temporal and multimodal information. However, modeling complex spatio-temporal and multimodal patterns poses unique algorithmic challenges. They are also notoriously difficult to work with due to the heavy compute and memory requirements. In this talk, I will give an overview of some of our recent efforts on self-supervised learning from unlabeled videos, focusing on multimodal learning with compute/memory-efficient models. I will show how contrastive learning with audio-visual correspondence leads to generalizable video representations and share how we obtain diverse and informative negative samples for contrastive learning. I will also discuss some of the major open challenges, including obtaining large-scale datasets without tedious human efforts for self-supervised learning.

Bio: Yale Song is a Senior Researcher at Microsoft Research in Redmond. His research is centered around computer vision and machine learning, especially on learning from unlabeled and noisy real-world visual data. Prior to MSR, he was a Senior Research Scientist/Research Lead at Yahoo Research NYC where he led the company's various research efforts on video understanding. Some of his works have been deployed in Microsoft Store, Yahoo Sports, Flickr and Tumblr, and have been featured in MIT News, Economist, NVIDIA PodCast, among others. He has served as Program Chair of ICMI 2019 and frequently serves as Area Chair at various computer vision and machine learning venues such as CVPR, ICCV, NeurIPS, and ICLR. He has been organizing the workshop series on Learning from Unlabeled Videos (LUV) at CVPR. He received Ph.D. in Computer Science from MIT where he was a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).
W14: Apr 13 Invited Talk and/or Paper Presentations

Assignment 5 due
Apr 19 Final Project Writeups due
Apr 21~23 Final Project Presentations due


Related Classes

This course was very heavily inspires by courses in other places. Most notably:

as well as:


  • Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press


  • PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
  • Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
  • TensorFlow: popular deep learning library from Google
  • Theano: another popular deep learning library
  • CNTK: Microsoft's deep learning cognitive toolkit library
  • scikit: Machine learning in Python


  • ImageNet: Large-scale image classification dataset
  • VQA: Visual Question Answering dataset
  • Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
  • LSMDC: Large-Scale Movie Description Dataset and challenge
  • Madlibs: Visual fil-in-the-blank dataset
  • ReferIt: Dataset of visual referring expressions
  • VisDial: Visual dialog dataset
  • ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
  • VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
  • VIST: VIsual StroyTelling dataset
  • CLEVR: Compositional Language and Elementary Visual Reasoning dataset
  • COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
  • Toronto COCO-QA: Toronto question answering dataset
  • Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
  • MovieQA: automatic story comprehension dataset from both video and text.
  • Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
  • imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
  • MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.