Leonid Sigal

Associate Professor, University of British Columbia

Course Information

Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).

In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.

Leonid Sigal (lsigal@cs.ubc.ca)

Mohit Bajaj (mbajaj01@cs.ubc.ca)
Siddhesh Khandelwal (skhandel@cs.ubc.ca)

Office hours:
TBD and by appointment (all communication to go through piazza)

Class meets:
Tuesday, Thursday 9:30 - 11:00 am, DMP 101


Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.

Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We will also provide credits for the use of Microsoft Azure cloud service for all students in the class. Note that the amount of credits will be limited and not replenish-able, which means you have to be judicial about their use and execution times. An optional (but extremely useful) tutorial on using Microsoft Azure will be given during the first 2 weeks of classes outside of the regular course meeting time.

Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome, but will only be accommodated to the extent there is physical room in the class.


Assignments (four assignments in total) 30%
   Assignments #1: Neural Networks Introduction (5%)
   Assignments #2: Convolutional Neural Networks (5%)
   Assignments #3: Recurrent Neural Network Language Models (10%)
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%)
Research papers20%
   Readings and reviews: Two papers a week after the break (10%)
   Presentations and discussion: One paper per semester (10%)
Group project (proposal, final presentation and web report)50%

Assignments (30% of the grade)

Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on our grade as well as overall understanding of the material.

Research papers (20% of the grade)

In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will also need to participate in paper presentation and debate. In other words, each student will need to present (defend) and argue against (attack) one paper in class (depending on the enrollment, this is likely to be done in small groups). Note that the expectation is that all students need to attend all classes, including those where your peers present. While I will not be taking attendence, if you miss too many classes for unspecified reasons I reserve the right to discressionary deduct up to 10% from your final grade.

Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins (or equivalent).

Structure of the reviews:
  Short summary of the paper (3-4 sentences)
  Main contributions (2-3 bullet points)
  Positive and negatives points (2-3 bullet points each)
  What did you not understand or was unclear about the paper? (2-3 bullet points)

Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Piazza as private notes to the instructor.

Paper Presentation and Debate: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference. A list of papers will be given out and students will be expected to submit a ranked list of their preferences. The presentation itself should be accompanied by slides, be clear and practiced. The student(s) should read the assigned paper and related work in enough detail to be able to lead a discussion and answer questions. A presentation should be roughly 30 minutes long (although that maybe adjusted based on enrollment). You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class. In addition, another student (or a group of students) will be assigned to atack the paper and the shortcomings it may have. The goal of this structure (which is different from past offering of this course) is to generate a healthy scientific debate about the papers.

Structure of the paper presentation:
  High-level overview of the problem and motivation
  Clear statement of the problem
  Overview of the technical details of the method, including necessary background
  Relationship of the approach and method to others discussed in class
  Discussion of strengths and weaknesses of the approach
  Discussion of strengths and weaknesses of the evaluation
  Discussion of potential extensions (published or potential)

Deadline: Each student (or student group) is required to have slides ready and meet with the instructor at least 2 day before the presentation, to obtain and incorporate feedback. Students are responsible for scheduling these meetings. Make sure you reach out to the instructor to schedule these meetings in advance.

Project (50% of the grade)

A major component of the course is a student-lead reserch project. Due to the size of the class, these are encouraged to be group projects with approximately 2-3 students (3 highly encouraged); individual projects are possible under certain circumstances with instructor approval. The scope of the project would need to be scaled appropriately based on the group size. The projects will be research oriented and each student in the group needs to contribute significantly to algorithmic components and implementation. Please start thinking about the project as early as possible. The expectation is that you will have a well formed idea by spring break and will present them right after.

The project can be on any interesting topic related to the course that the student comes up with himself/herself or with the help of the instructor. Some project ideas will be suggested in class. Note that re-implementing an existing paper is not sufficient. The project needs to attempt to go beyond and existing publication. The grade will depend on the project definition, how well you present them in the report, how well you position your work in the related literature, how thorough are your experiments and how thoughtful are your conclusions.

When thinking about the project, and for proposal, you should think about:
  The overall problem you want to solve
  What dataset you are going to use (see sample list below)
  What model you will use and/or from what paper you will start
  What is the related literature you should look at
  Who on the team will specifically work on what
  How will you evaluate the performance

Deadline: In the middle of semester you will need to hand in a project proposal and give a quick (5 minutes or less) presentation of what you intend to do. Prior to this you need to discuss the project idea with the instructor (in person or via e-mail). Final project presentations will be given during the finals week at the scheduled time, where each group will present their findings in, roughly 10-15 min, presentation. The final writeup will take the form of a (password protected) project webpage and should contain links to the github repository with the code. This writeup will be due on the day of the final.


Date Topic Reading
W1: Jan 3 Introduction to the Course (slides)
- What is multi-modal learning?
- Challenges in multi-modal learning
- Course expectations and grading
(optional) The Development of Embodied Cognition: Six Lessons from Babies by Smith and Gasser
W2: Jan 8 Introduction to Deep Learning [Part 1] (slides)
- Multi-layer Perceptron (MLP)
- Stochastic Gradient Descent
- Computational graphs
- NN as Universal Approximators
Assignment 1 out (download)
Deep Learning in Nature by LeCun et al.
Automatic Differentiation in Machine Learning: a Survey by Baydin et al.
W2: Jan 10 Introduction to Deep Learning [Part 2] (slides)
- Regularization (L1, L2, batch norm, dropout)
- Terminology and practical advice on optimization
- Simple loss functions
- Structure, parameters and hyper-parameters
Introduction to Computer Vision (slides)
- History
- Basic operations and problems
- Image filtering and features
W3: Jan 14 Assignment 1 due
W3: Jan 15 Convolutional Neural Networks [part I] (slides)
- CNN Basics
- CNN as a feature representation
Assignment 2 out (download, data [~10gb])
Chapter 9, 9.1-9.3 of Deep Learning Book
W3: Jan 17 Convolutional Neural Networks [part II] (slides)
- Regularization, Data Augmentation
- Pre-training and transferability
- AlexNet, VGG, GoogleLeNet, ResNet
CNNs for Computer Vision by Srinivas et al
W4: Jan 22 Convolutional Neural Networks [part III] (slides)
- Fully convolutional networks (transpose convolutions)
- CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, YOLO)
R-CNN by Girshick et al
W4: Jan 23 Assignment 2 due
W4: Jan 24 Visualizing CNNs (slides)
- Guided BackProp
- Gradient ascent
- Adversarial examples
Introduction to Natural Language Processing (slides)
- Tasks in NLP
- Why NLP is difficult
- Representing words and text
Assignment 3 out (download)
Word Representations in Vector Space by Mikolov et al
Chapter 10 of Deep Learning Book
W5: Jan 29 Recurrent Neural Networks (part I) (slides)
- Recurrent Neural Networks
- Long Short Term Memory Networks (LSTMs)
- Gated Recurrent Units (GRUs)
W5: Jan 31 Recurrent Neural Networks (part II) (slides)
- Encoder-decoder RNNs
- Translation models
- Attention models
W6: Feb 5 Recurrent Neural Networks (part III) (slides)
- Applications: Image Captioning, Quation Answering, Activity Recognition
Unsupervised Representation Learning (part I) (slides)
- Autoencoders, Denoising Autoencoders
Assignment 3 due
Assignment 4 out (download)
Unified Visual-Semantic Embeddings by Kiros et al
W6: Feb 7 Unsupervised Representation Learning (part II) (slides)
- Stacked Autoencoders, Context Encoders
- Bottleneck Theory
Multimodal Learning (part I) (slides)
- Intro to Multimodal Learning
- Multimodal Joint Representations
- Canonical Correlation Analysis (CCA)
W7: Feb 12 Snow Day
W7: Feb 14 Multimodal Learning (part II) (slides)
- Joint embedding models
- Applications
Generative Models (slides)
- PixelRNN, VAEs (intro)
W7: Feb 15 Assignment 4 due
W8: Feb 19 Spring Break (no class)
W8: Feb 21 Spring Break (no class)
W9: Feb 26 Final Project Pitches
W9: Feb 28 Final Project Pitches
W10: Mar 5 Final Project Pitches
W10: Mar 7 Generative Models (slides)
- VAEs
- GANs
- Applications
W11: Mar 12 Graph Neural Networks (slides)
- Convolutions on a Graph
- GNNs as Message Passing Networks
- Variants of GNNs
Deep Reinforcement Learning (slides)
- Introduction
- Value-based RL, Policy-based RL, REINFORCE
- Applications
Relational inductive biases, deep learning, and graph networks by Battaglia et al
Gated Graph Sequence Neural Networks by Li et al
W11: Mar 14 Paper Readings 1: Image Captioning and Understanding
  1. Neural Baby Talk, J. Lu, J. Yang, D. Batra, D. Parikh, CVPR 2018.
    Presented by Shane Sims, Austin Rothwell, Soojin Lee (slides)
  2. Graph R-CNN for Scene Graph Generation, J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, ECCV 2018.
    Presented by Zicong Fan, Xiang Liu, Keegan Lensink (slides)
Graph R-CNN for Scene Graph Generation by Yang et al
W12: Mar 19 Paper Readings 2: Visual Question Answering
  1. Neural module networks, J. Andreas, M. Rohrbach, T. Darrell, D. Klein, CVPR, 2016.
    Presented by Behnami Delaram, Golara Javadi, Ghazal Sahebzamani, Pardiss Danaei (slides)
  2. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, A. Das, S. Kottur, J. Moura, S. Lee, D. Batra, ICCV, 2017.
    Presented by Matt Dietrich, Ali Mohammad Mehr, Ignacio Iturralde, Amir Refaee (slides)
Neural module networks, by Andreas et al
W12: Mar 21 Paper Readings 3: Architecture Design Choices
  1. FiLM: Visual Reasoning with a General Conditioning Layer, E. Perez, F. Strub, H. Vries, V. Dumoulin, A. Courville, AAAI, 2018.
    Presented by Arya Rashtchian, Johann Lingohr (slides)
  2. AutoAugment: Learning Augmentation Policies from Data, E. Cubuk, B. Zoph, D. Mané, V. Vasudevan, Q. Le.
    Presented by Maryam Tayyab, Gorisha Agarwal, Wen Ruochen (slides)
  3. Attention is all you need, A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, NIPS 2017.
    Presented by Andreas Mund, Curtis Huebner, Wan Shing Martin Wang (slides)
FiLM: Visual Reasoning with a General Conditioning Layer by Perez et al
W13: Mar 26 Guest Lecture
W13: Mar 28 Paper Readings 4: Neural Architecture Search / AutoML
  1. Learning Transferable Architectures for Scalable Image Recognition, B. Zoph, V. Vasudevan, J. Shlens, Q. Le, CVPR 2018.
    Presented by Chris Yoon, Kyle Leeners, Nam Hee Kim, Mir Rayat Imtiaz Hossain (slides)
  2. Distilling the Knowledge in a Neural Network, G. Hinton, O. Vinyals, J. Dean.
    Presented by Si Yi (Cathy) Meng, Muhammad Shayan, Farnoosh Javadi, Jiefei Li (slides)
Learning Transferable Architectures for Scalable Image Recognition by Zoph et al
-- OR --
Distilling the Knowledge in a Neural Network by Hinton et al
W14: April 2 Paper Readings 5: Having Fun with Modalities
  1. Diverse Image-to-Image Translation via Disentangled Representations, H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, M.-H. Yang.
    Presented by Xin (Kathy) Zhao, Jay Fu, Mona Fadaviaradakani, Sam (McConnell) (slides)
  2. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, A. Owens, A. A. Efros, ECCV 2018.
    Presented by Yuan Yao, Eric Semeniuc, Hansen Jan, Yuchi Zhang (slides)
Diverse Image-to-Image Translation via Disentangled Representations by Lee et al
W14: April 4 Paper Readings 6: Learning with Little Data
  1. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, C. Finn, P. Abbeel, S. Levine, ICML, 2017.
    Presented by Cheng Xie, Ariel Shann, Josh Leland, German Novakovskiy (slides)
  2. Taskonomy: Disentangling Task Transfer Learning, A. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, S. Savarese, CVPR, 2018.
    Presented by Tanzila Rahman, Peyman Bateni, Vibudh Agrawal, Porto Lucas (slides)
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks by Finn et al
TBD Final Project Presentations
Final Project Writeups due


Related Classes

This course was very heavily inspires by courses in other places. Most notably:

as well as:


  • Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press


  • PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
  • Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
  • TensorFlow: popular deep learning library from Google
  • Theano: another popular deep learning library
  • CNTK: Microsoft's deep learning cognitive toolkit library
  • scikit: Machine learning in Python


  • ImageNet: Large-scale image classification dataset
  • VQA: Visual Question Answering dataset
  • Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
  • LSMDC: Large-Scale Movie Description Dataset and challenge
  • Madlibs: Visual fil-in-the-blank dataset
  • ReferIt: Dataset of visual referring expressions
  • VisDial: Visual dialog dataset
  • ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
  • VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
  • VIST: VIsual StroyTelling dataset
  • CLEVR: Compositional Language and Elementary Visual Reasoning dataset
  • COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
  • Toronto COCO-QA: Toronto question answering dataset
  • Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
  • MovieQA: automatic story comprehension dataset from both video and text.
  • Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
  • imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
  • MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.