Leonid Sigal

Associate Professor, University of British Columbia

Course Information

Multimodal machine learning is a multi-disciplinary research field which addresses some of the core goals of artificial intelligence by integrating and modeling two or more data modalities (e.g., visual, linguistic, acoustic, etc.). This course will teach fundamental concepts related to multimodal machine learning, including (1) representation learning, (2) translation and mapping, and (3) modality alignment. While the fundamental techniques covered in this course are applicable broadly, the focus will on studying them in the context of joint reasoning and understanding of images/videos and language (text).

In addition to fundamentals, we will study recent rich body of research at the intersection of vision and language, including problems of (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics. On a technical side, we will be studying neural network architectures of various forms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), memory networks, attention models, neural language models, structures prediction models.

Leonid Sigal (lsigal@cs.ubc.ca)

Shikib Mehri (mehrishikib@gmail.com)

Office hours:
by appointment (send e-mail)

Class meets:
Tuesday, Thursday 11:00 - 12:30 pm, ICICS 246DMP 101


Prerequisites: You are required to have taken CPSC 340 or equivalent, with a satisfactory grade. Courses in Computer Vision or Natural Language Processing are a plus. In summary, this is intended to be a demanding graduate level course and should not be your first encounter with Machine Learning. If you are unsure whether you have the background for this course please e-mail or talk to me. Also, this course is heavy on programming assignments, which will done exclusively in Python. No programming tutorials will be offered, so please ensure that you are comfortable with programming and Python.

Computational Requirements: Due to the size of the data, most of the assignment in the class will require a CUDA-capable GPU with at least 4GB of GPU RAM to execute the code. GPU will also be needed to develop course project which is a very significant part of the grade. You are welcome to use your own GPU if you have one. We will also provide credits for the use of Microsoft Azure cloud service for all students in the class. Note that the amount of credits will be limited and not replenish-able, which means you have to be judicial about their use and execution times. An optional (but extremely useful) tutorial on using Microsoft Azure will be given during the first 2 weeks of classes outside of the regular course meeting time.

Audit Policy: If you are a registered auditor, you are expected to complete assignments but not present papers or participate in the final project. Those unregistered who would like to audit are not expected, or required, to do any assignments or readings. Unregistered auditors are welcome, but will only be accommodated to the extent there is physical room in the class.


Assignments (four assignments in total) 30%
   Assignments #1: Neural Networks Introduction (5%)
   Assignments #2: Convolutional Neural Networks (5%)
   Assignments #3: Recurrent Neural Network Language Models (10%)
   Assignments #4: Neural Model for Image Captioning / Retrieval (10%)
Research papers20%
   Readings and reviews: Two papers a week after the break (10%)
   Presentations and discussion: One paper per semester (10%)
Group project (proposal, final presentation and web report)50%

Assignments (30% of the grade)

Assignments in the course are designed to build the fundamental skills necessary for you to understand how to implement most state-of-the-art papers in vision, language or intersection of the two. The assignments are designed to build on one another and will lay the foundation for your final project. So while individual assignment may not be worth a lot of points in isolation, not doing one will likely have significant effect on our grade as well as overall understanding of the material.

Deadline: Assignments will always be due on Fridays at 5pm PST.

Research papers (20% of the grade)

In the second half of the course, every week we will read 2 papers as a class (additional papers will be presented in class, but will not be required as reading for the whole class). Each student is expected to read all assigned required papers and write writeups/reviews about the selected papers. Each student will need to also present one paper in class (depending on the enrollment that is likely to be in groups).

Reviews: Reviews should be succinct and to the point; bulleted lists are welcomed when applicable. When you present, you do not need to hand in the review. Reviews are expected to be < 1 page in 11 pt Times New Roman with 1 inch margins.

Structure of the reviews:
  Short summary of the paper (3-4 sentences)
  Main contributions (2-3 bullet points)
  Positive and negatives points (2-3 bullet points each)
  What did you not understand or was unclear about the paper? (2-3 bullet points)

Deadlines: The reviews will be due one day before the class at 11:59pm. The reviews should be submitted via Piazza as private notes to the instructor.

Presentation: Each student will need to present a paper in class (either individually or as a group depending on enrollment). Students will be assigned to papers based on their preference (a list of choices will be solicited from the students with a ranked list of papers they want to present). The presentation itself should be accompanied by slides, be clear and practiced. The student(s) should read the assigned paper and related work in enough detail to be able to lead a discussion and answer questions. A presentation should be roughly 30 minutes long (although that maybe adjusted based on enrollment). You are allowed to take material from presentations on the web as long as you cite ALL sources fairly (including other papers if needed). However, you need to make the material your own and present it in the context of the class.

Structure of the paper presentation:
  High-level overview of the problem and motivation
  Clear statement of the problem
  Overview of the technical details of the method, including necessary background
  Relationship of the approach and method to others discussed in class
  Discussion of strengths and weaknesses of the approach
  Discussion of strengths and weaknesses of the evaluation
  Discussion of potential extensions (published or potential)

Deadline: Each student (or student group) is required to have slides ready and meed with the instructor at least 2 day before the presentation, to obtain and incorporate feedback. Students are responsible for scheduling these meeting.

Project (50% of the grade)

A major component of the course is a student project. Due to the size of the class, these are encouraged to be group projects with approximately 2-3 students (3 highly encouraged); individual projects are possible under certain circumstances with instructor approval. The scope of the project would need to be scaled appropriately based on the group size. The projects will be research oriented and each student in the group needs to contribute significantly to algorithmic components and implementation. Please start thinking about the project early.

The project can be on any interesting topic related to the course that the student comes up with himself/herself or with the help of the instructor. Some project ideas will be suggested in class. Note that re-implementing an existing paper is not sufficient. The project needs to attempt to go beyond and existing publication. The grade will depend on the project definition, how well you present them in the report, how well you position your work in the related literature, how thorough are your experiments and how thoughtful are your conclusions.

When thinking about the project, and for proposal, you should think about:
  The overall problem you want to solve
  What dataset you are going to use (see sample list below)
  What model you will use and/or from what paper you will start
  What is the related literature you should look at
  Who on the team will specifically work on what
  How will you evaluate the performance

Deadline: In the middle of semester you will need to hand in a project proposal and give a quick (5 minutes or less) presentation of what you intend to do. Prior to this you need to discuss the project idea with the instructor (in person or via e-mail). Final project presentations will be given during the last two lectures in the semester, where each group will present their findings in, roughly 10-15 min, presentation. The final writeup will take the form of a (password protected) project webpage and should contain links to the github repository with the code. This writeup will be due last day of classes.


Date Topic Reading
W1: Jan 4 Introduction (slides)
- What is multi-modal learning?
- Challenges in multi-modal learning
- Course expectations and grading
W2: Jan 9 Introduction to Deep Learning (slides)
- Multi-layer Perceptron (MLP)
- Stochastic Gradient Descent
- Computational graphs
- Structure, parameters and hyper-parameters
Assignment 1 out (download here)
Deep Learning in Nature by LeCun et al.
Automatic Differentiation in Machine Learning: a Survey by Baydin et al.
W2: Jan 11 Introduction to Deep Learning (cont.) (slides)
- Regularization (L1, L2, batch norm, dropout)
- Terminology and practical advice on optimization
Introduction to Computer Vision (slides)
- History
- Basic operations and problems
- Image filtering and features
W3: Jan 15 Assignment 1 due
W3: Jan 16 Convolutional Neural Networks (part I) (slides)
- CNN Basics
- CNN as a feature representation
- Pre-training and transferability
Assignment 2 out (download here)
Chapter 9, 9.1-9.3 of Deep Learning Book
W3: Jan 18 Convolutional Neural Networks (part II) (slides)
- AlexNet, VGG, GoogleLeNet, ResNet
- Fully convolutional networks (transpose convolutions)
- CNNs for object detection (RCNN, Fast RCNN, Fater RCNN, YOLO)
CNNs for Computer Vision by Srinivas et al
R-CNN by Girshick et al
W4: Jan 23 Visualizing CNNs (slides)
- Guided BackProp
- Gradient ascent
- Adversarial examples
Introduction to Natural Language Processing (slides)
- Tasks in NLP
- Why NLP is difficult
- Representing words and text
Word Representations in Vector Space by Mikolov et al
W4: Jan 24 Assignment 2 due
W4: Jan 25 Recurrent Neural Networks (part I) (slides)
- Recurrent Neural Networks
- Long Short Term Memory Networks (LSTMs)
- Gated Recurrent Units (GRUs)
Chapter 10 of Deep Learning Book
W4: Jan 27 Assignment 3 out (download here)
W5: Jan 30 Recurrent Neural Networks (part II) (slides)
- Encoder-decoder RNNs
- Translation models
- Attention models
- Applications: Image Captioning, Quation Answering, Activity Recognition
W5: Feb 1 Unsupervised Representation Learning, Multimodal Learning (slides)
- Autoencoders, Denoising Autoencoders
- Stacked Autoencoders, Context Encoders
- Intro to Multimodal Learning
- Multimodal Joint Representations
W6: Feb 6 Coordinated Multimodal Learning (slides)
- Canonical Correlation Analysis (CCA)
- Joint embedding models
Unified Visual-Semantic Embeddings by Kiros et al
W6: Feb 7 Assignment 3 due
W6: Feb 8 Generative Models (slides)
- PixelRNN, VAEs
Assignment 4 out (download here)
W7: Feb 13 Generative Models (slides)
- GANs
Deep Reinforcement Learning (slides)
- Introduction
W7: Feb 15 Final Project Pitches
W7: Feb 16 Assignment 4 due
W8: Feb 20 Spring Break (no class)
W8: Feb 22 Spring Break (no class)
W9: Feb 27 Image Captioning
  1. Generating Visually Descriptive Language from Object Layouts, X. Yin, V. Ordonez, EMNLP, 2017.
    Presented by Xing Zeng -- slides
  2. Towards Diverse and Natural Image Descriptions via Conditional GAN, B. Dai, S. Fidler, R. Urtasun, D. Lin, ICCV, 2017.
    Presented by Kevin Dsouza, Ainaz Hajimoradlou -- slides
Towards Diverse and Natural Image Descriptions via Conditional GAN by Dai et al
W9: Mar 1 Image Grounding
  1. Grounding of Textual Phrases in Images by Reconstruction, A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele, ECCV, 2016.
    Presented by Jiaxuan Chen, Meng Li -- slides
  2. Commonly Uncommon: Semantic Sparsity in Situation Recognition, M. Yatskar, V. Ordonez, L. Zettlemoyer, A. Farhadi, CVPR, 2017.
    Presented by Xiaomeng Ju, Saeid Naderiparizi -- slides
Either of the two papers
W10: Mar 6 Visual Question Answering
  1. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, M. Malinowski, M. Rohrbach, M. Fritz, ICCV, 2015.
    Presented by Hooman Shariati, Wen Xiao -- slides
  2. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, A. Das, S. Kottur, J. Moura, S. Lee, D. Batra, ICCV, 2017.
    Presented by Maria Lubeznov, Weining Hu -- slides
Either of the two papers
W10: Mar 8 Video captioning
  1. Danse captioning events in video, R. Krishna, K. Hata, F. Ren, L. Fei-Fei, J. C. Niebles, ICCV, 2017.
    Presented by Parisa Asgharzadeh, Sijia Tian -- slides
  2. End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering Y.J. Yu, H. Ko, J. Choi, G. Kim, CVPR, 2017.
    Presented by Bicheng Xu, Weirui Kong -- slides
Danse captioning events in video by Krishna et al
W11: Mar 13 Compositional Networks
  1. Modeling Relationships in Referential Expressions with Compositional Modular Networks, R. Hu, M. Rohrbach, J. Andreas, T. Darrell, K. Saenko, CVPR, 2017.
    Presented by Minzhi Liao, Hooman Hashemi -- slides
  2. Inferring and Executing Programs for Visual Reasoning, J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, R. Girshick, ICCV, 2017.
    Presented by Gursimran Singh, Borna Ghotbi -- slides
Neural module networks, J. Andreas, M. Rohrbach, T. Darrell, D. Klein, CVPR, 2016.
W11: Mar 15 Memory-augmented Networks
  1. Visual Reference Resolution using Attention Memory for Visual Dialog, P. Seo, A. Lehrmann, B. Han, L. Sigal, NIPS, 2017.
    Presented by Siddhesh Khandelwal, Anand Jayarajan -- slides
  2. Dynamic memory networks for visual and textual question answering, C. Xiong, S. Merity, R. Socher, ICML, 2016.
    Presented by Zaccary Alperstein, Mohit Bajaj -- slides
Either of the two papers
W12: Mar 20 Visual Storytelling
  1. The Amazing Mysteries of the Gutter: Drawing Inferences between Panels in Comic Book Narratives, M. Iyyer, V. Manjunatha, A. Guha, Y. Vyas, J. Boyd-Graber, H. Daume, L. Davis, CVPR, 2017.
    Presented by Dana Bazazeh, Michael Przystupa -- slides
  2. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books, Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, ICCV, 2015.
    Presented by Itrat Akhter, Setareh Cohan -- slides
Aligning Books and Movies by Zhu et al
W12: Mar 22 Video and Sound / Multiple Modalities
  1. SoundNet: Learning Sound Representations from Unlabeled Video, Y. Aytar, C. Vondrick, A. Torralba, NIPS, 2016.
    Presented by William Qi -- slides
  2. Learning Aligned Cross-Modal Representations from Weakly Aligned Data, L. Castrejón, Y. Aytar, C. Vondrick, H. Pirsiavash, A. Torralba, CVPR, 2016.
    Presented by Alexandra Kim -- slides
Learning Aligned Cross-Modal Representations from Weakly Aligned Data by Castrejón et al
W13: Mar 27 Image-to-image translation
  1. Scribbler: Controlling Deep Image Synthesis with Sketch and Color, P. Sangkloy, J. Lu, C. Fang, F. Yu, J. Hays, CVPR, 2017.
    Presented by Hung Yu Ling, Marjan Albooyeh -- slides
  2. Image-to-image translation using cycle-consistent Adversarial Neural Networks, J.-Y. Zhu, T. Park, P. Isola, A. Efros, ICCV, 2017.
    Presented by Mohammed Suhail, Kevin Woo -- slides
Either of the two papers
W13: Mar 29 Text to Image Generation
  1. Generative Adversarial Text to Image Synthesis, S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, ICML, 2016.
    Presented by Ke Ma, Taylor Lundy -- slides
  2. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. Metaxas, ICCV, 2017.
    Presented by Tianyang (Thomas) Liu, Polina Zablotskaia -- slides
Either of the two papers
W14: April 3 Final Project Presentations
W14: April 5 Final Project Presentations
W15: April 9 Final Project Writeups due


Related Classes

This course was very heavily inspires by courses in other places. Most notably:

as well as:


  • Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press


  • PyTorch: popular deep learning library that we will use in class; note the links to examples and tutorials
  • Keras: easy to use high-level neural networks API capable of running on top of TensorFlow, CNTK, Theano
  • TensorFlow: popular deep learning library from Google
  • Theano: another popular deep learning library
  • CNTK: Microsoft's deep learning cognitive toolkit library
  • scikit: Machine learning in Python


  • ImageNet: Large-scale image classification dataset
  • VQA: Visual Question Answering dataset
  • Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset
  • LSMDC: Large-Scale Movie Description Dataset and challenge
  • Madlibs: Visual fil-in-the-blank dataset
  • ReferIt: Dataset of visual referring expressions
  • VisDial: Visual dialog dataset
  • ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video
  • VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more
  • VIST: VIsual StroyTelling dataset
  • CLEVR: Compositional Language and Elementary Visual Reasoning dataset
  • COMICS: Dataset of annotated comics with visual panels and dialog transcriptions
  • Toronto COCO-QA: Toronto question answering dataset
  • Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference
  • MovieQA: automatic story comprehension dataset from both video and text.
  • Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning
  • imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity.
  • MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.