Leonid Sigal

Associate Professor, University of British Columbia

Course Information

Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).

In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.

Content Delivery and Covid Precautions: The lectures will be offered in-person only and no recordings will be made. Unfortunately, for this reason, a hybrid delivery of material will not be availble. We will experiment with hybrid office hours, as we believe this will benefit the students. Students are strongly encouraged and expected (but not required) to wear masks in class. This is largely for the benefit of your fellow students with whom you will sit in close proximity. Instructor will not wear a mask when lecturing (this improves delivery of the material) but will put on the mask in close interaction setting or when requested by students. If at any point a student is diagnosed with COVID or has symptoms, he/she are expeected to follow UBC and provincial guidelines and isolate at home. Please inform instructor of such cases and he will do his best to provide accomodations.

Leonid Sigal (lsigal@cs.ubc.ca)

Rayat Hossain (rayat137@cs.ubc.ca)
Tanzila Rahman (tanzila.himu@gmail.com)

Office hours:
TBD and by appointment (all communication to go through piazza)

Class meets:
Tuesday, Thursday 11:00 - 12:30 pm, ICICS 246


Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.

Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We also encourage you to use Google Colab which should be sufficient for your assignment but likely not for the project. You may also register for student accounts with Amazon AWS (free $25 credit) or with Micrasoft Azure (free $100 credit). Note that while TAs will do their best to help with your specific enviornment setup, due to heterogeneity of various setups that may results from these choices, it is really up to the student to ultimately sort out the details.

Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome to contact me and I will asses feasibility of letting you attend the lectures (subject to space in the lecture room).


Assignments (five assignments in total) 40%
   Assignments #0: Introduction to PyTorch (0% -- ungraded)
   Assignments #1: Neural Networks Introduction (5%)
   Assignments #2: Convolutional Neural Networks (5%)
   Assignments #3: Recurrent Neural Network Language Models (10%)
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%)
   Assignments #5: Advanced Neural Architectures (10%)
Research papers20%
   Readings and reviews: Two papers a week after the break (10%)
   Presentations and discussion: One paper per semester (10%)
Project (proposal, final presentation and web report)40%

Assignments (40% of the grade)

Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on the grade as well as overall understanding of the material.

Research papers (20% of the grade)

In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation. Note that the expectation is that all students need to attend all classes, including those where your peers present.

Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).

Structure of the reviews:
  Short summary of the paper (3-4 sentences)
  Main contributions (2-3 bullet points)
  Positive and negatives points (2-3 bullet points each)
  What did you not understand or was unclear about the paper? (2-3 bullet points)

Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Canvas.

Paper Presentation: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. We are likely to resort to pre-reocrded presentations, rather than the live ones. You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. More details will be given as we get closer to paper readings and presentations.

Structure of the paper presentation:
  High-level overview of the problem and motivation
  Clear statement of the problem
  Overview of the technical details of the method, including necessary background
  Relationship of the approach and method to others discussed in class
  Discussion of strengths and weaknesses of the approach
  Discussion of strengths and weaknesses of the evaluation
  Discussion of potential extensions (published or potential)

Project (40% of the grade)

Details on projects to follow ....


Date Topic Reading and Resources
W1: Sept 6 Grad Classes Canceled
W1: Sept 8 Introduction to the Course (slides)
- What is multi-modal learning?
- Challenges in multi-modal learning
- Course expectations and grading
1. (optional) Reading List for Topics in Multimodal Machine Learning by Liang
2. (optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser
3. Multimodal Machine Learning: A Survey and Taxonomy by Baltrusaitis, Ahuja and Morency
4. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods by Mogadala, Kalimuthu and Klakow
W1: Sept 9 Assignment 0 out (download)
Credit: Assignment 0 is adopted from the course assignment given out by Justin Johnson in EECS 498/598 at University of Michigan (see link). The adoptation was done by our own Suhail Mohammed.
W2: Sept 13 Introduction to Deep Learning [Part 1] (slides)
- Multi-layer Perceptron (MLP)
- Stochastic Gradient Descent
- Computational graphs
- NN as Universal Approximators

Assignment 1 out (download)
Deep Learning in Nature by LeCun et al.
Automatic Differentiation in Machine Learning: a Survey by Baydin et al.
W2: Sept 15 Introduction to Deep Learning [Part 2] (slides)
- More on activation functions
- Regularization (L1, L2, batch norm, dropout)
- Terminology and practical advice on optimization
- Simple loss functions
- Structure, parameters and hyper-parameters
W3: Sept 20 Introduction to Deep Learning [Part 3] (slides)
- Debugging strategies and techniques

Introduction to Computer Vision (slides)
- History
- Basic operations and problems
- Image filtering and features

Convolutional Neural Networks [Part 1] (slides)
- CNN Basics
- CNN layer
Chapter 9, 9.1-9.3 of Deep Learning Book
W3: Sept 21 Assignment 1 due
W3: Sept 22 Convolutional Neural Networks [Part 2] (slides)
- CNN, Pooling Layers
- Invariance vs. Equivariance
- Regularization, Data Augmentation
- Pre-training and transferability

Assignment 2 out (download, data [~10gb])
CNNs for Computer Vision by Srinivas et al
W4: Sept 27 Convolutional Neural Networks [Part 3] (slides)
- CNNs learning positional information
- Model ensembling and soups
- Static vs. dynamic computational graphs
- Image classification
- AlexNet, VGG
- GoogleLeNet, ResNet
CNNs for Computer Vision by Srinivas et al
W4: Sept 29 Convolutional Neural Networks [Part 4] (slides)
- Vanishing and exploding gradients
- ResNet (+ theory)
- Segmantation networks
- Fully convolutional networks (transpose convolutions)
- CNNs for object detection (RCNN)
Mask R-CNN by He et al
W5: Oct 3 Assignment 2 due
W5: Oct 4 Convolutional Neural Networks [Part 4] (slides)
- CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, Mask RCNN, YOLO)

Visualizing CNNs (slides)
- Guided BackProp
- Gradient ascent
- Adversarial examples

Introduction to Natural Language Processing (slides)
- Tasks in NLP
- Why NLP is difficult
- Representing words and text

Assignment 3 out (download)
W5: Oct 6 Recurrent Neural Networks [Part 1] (slides)
- Representing words and text
- Intro to language modeling
- Recurrent Neural Networks (RNNs)
- Encoder-decoder RNNs
Word Representations in Vector Space by Mikolov et al
Chapter 10 of Deep Learning Book
W6: Oct 11 Recurrent Neural Networks [Part 2] (slides)
- Encoder-decoder RNNs
- Translation models
- Long Short Term Memory Networks (LSTMs)
- Gated Recurrent Units (GRUs)
- Attention models
W6: Oct 13 Recurrent Neural Networks [Part 3] (slides)
- Attention models
- Forms of attention
- Transformer
- Applications: Language Translation, BERT, Image Captioning
W7: Oct 17 Recurrent Neural Networks Applications [Part 2] (slides)
- Masked Language Modeling (BERT)
- Sequential Language Modeling (GPT3)
- Image Captioning
- Visual Question Answering, Visual Dialogs
W7: Oct 20 Recurrent Neural Networks Applications [Part 3] (slides)
- Activity Recognition
- Vision Transformers, SWIN Transformers
- DETR, Language Grounding

Unsupervised Representation Learning (slides)
- Autoencoders, Denoising Autoencoders
- Stacked Autoencoders, Context Encoders
- Bottleneck Theory

Project Teams Formed
Assignment 4 out (download)
W7: Oct 23 Assignment 3 due
W8: Oct 25 Unsupervised Representation Learning (slides)
- Bottleneck Theory

Multimodal Learning [part I] (slides)
- Intro to Multimodal Learning
- Multimodal Joint Representations
- Canonical Correlation Analysis (CCA)
Unified Visual-Semantic Embeddings by Kiros et al
W8: Oct 27 Multimodal Learning [part 2] (slides)
- Joint embedding models
- Applications

Generative Models [part 1] (slides)
- PixelRNN, PixelCNN
W9: Nov 1 Final Project Pitches (Part 1)
W9: Nov 3 Final Project Pitches (Part 2)
W10: Nov 8 Generative Models [part 2] (slides)
- Variational Autoencoders (VAEs)
- Vector Quantized Variational Autoencoders (VQ-VAEs)
- Applications

Assignment 4 due
W10: Nov 10 No Class
W11: Nov 15 Generative Models [part 3] (slides)
- Vector Quantized Variational Autoencoders (VQ-VAEs)
- Generative Adversarial Networks (GANs)
- DCGAN, Conditional GAN
- Image-to-Image Tranlation: pix2pix, CycleGAN
- Laplacyan Pyramid GAN, InfoGAN, Adversarial Autoencoders
W11: Nov 16 Paper presentation selection quiz due
W11: Nov 17 Diffusion Models (slides)
Guest lecture by Saeid Naderiparizi. Assignment 5 out (download)
Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon
or Denoising Diffusion Probabilistic Models by Ho, Jain and Abbeel
(Only read ONE. Song has a nice blog explaining this.)
W12: Nov 22 (slides) Reading:
Graph Attention Networks by Velickovic et al.
W12: Nov 24 (slides)
W13: Nov 29 Deep Reinforcement Learning (slides)
- Introduction
- Value-based RL, Policy-based RL, Q-Learning, REINFORCE
- RL Applications

W13: Dec 1 (slides)
W13: Dec 2 Paper presentation due
W14: Dec 6 (slides)
Assignment 5 due


Related Classes

This course was very heavily inspires by courses in other places. Most notably:

as well as:


  • Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press


  • PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
  • Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
  • TensorFlow: popular deep learning library from Google
  • Theano: another popular deep learning library
  • CNTK: Microsoft's deep learning cognitive toolkit library
  • scikit: Machine learning in Python


  • ImageNet: Large-scale image classification dataset
  • VQA: Visual Question Answering dataset
  • Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
  • LSMDC: Large-Scale Movie Description Dataset and challenge
  • Madlibs: Visual fil-in-the-blank dataset
  • ReferIt: Dataset of visual referring expressions
  • VisDial: Visual dialog dataset
  • ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
  • VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
  • VIST: VIsual StroyTelling dataset
  • CLEVR: Compositional Language and Elementary Visual Reasoning dataset
  • COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
  • Toronto COCO-QA: Toronto question answering dataset
  • Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
  • MovieQA: automatic story comprehension dataset from both video and text.
  • Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
  • imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
  • MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.