\documentclass{article}

\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{nicefrac}
\usepackage{listings} % For displaying code
\usepackage{algorithm2e} % pseudo-code

% Answers
\def\ans#1{\par\gre{Answer: #1}}
%\def\ans#1{} % Comment this line to produce document with answers

% Colors
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}

% Math
\def\R{\mathbb{R}}
\def\argmax{\mathop{\rm arg\,max}}
\newcommand{\argmin}[1]{\mathop{\hbox{argmin}}_{#1}}
\newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\alignStar}[1]{\begin{align*}#1\end{align*}}
\def\half{\frac 1 2}
\def\cond{\; | \;}

% LaTeX
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{a2f/#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{a2f/#2}\end{center}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}


\begin{document}

\title{CPSC 540 Assignment 2 (due February 12 at midnight)}
\author{}
\date{}
\maketitle
\vspace{-4em}

The assignment instructions are the same as for the previous assignment, but for this assignment you can work in groups of 1-3. However, please only hand in one assignment for the group.


\blu{\enum{
\item Name(s):
\item Student ID(s):
}}


\pagebreak

\section{Convexity}

\subsection{From Definitions}

\blu{Show that the following functions are convex, by only using one of the definitions of convexity. In other words, don't use the ``operations that preserve convexity" or other convexity results stated in class,, but instead show that the function follows from one of the definitions of convexity we gave in class}:\footnote{That $C^0$ convex functions are below their chords, that $C^1$ convex functions are above their tangents, or that $C^2$ convex functions have a positive semidefinite Hessian.}
\enum{
\item Upper bound on second-order Taylor expansion: $f(w) = f(v) + \nabla f(v)^T(w-v) + \frac{L}{2}\norm{w-v}^2$, for $L > 0$, which is often used in the analysis of gradient descent.
\item Maximum absolute value: $f(w) = \max_{j \in \{1,2,\dots,d\}}|w_j|$.
\item The exponential loss, $f(w) = \sum_{i=1}^n \exp(-y^iw^\top x^i)$, used by AdaBoost.
}
Hint: Max and absolute value are not differentiable in general, so you cannot use the Hessian for functions involving those operations.

\pagebreak

\subsection{Using Operations that Preserve Convexity}

\blu{Show that the following functions are convex (you can use results from class and operations that preserve convexity if they help)}:
\enum{
\setcounter{enumi}{3}
\item Support vector regression: $f(w) = \sum_{i=1}^N\max\{0, |w^\top x_i - y_i| - \epsilon\} + \frac{\lambda}{2}\norm{w}_2^2$.
\item Mixed-norm regularization: $f(w) = \norm{w}_{p,q} = \left(\sum_{g\in\mathcal{G}}\norm{w_g}^p_q\right)^{\frac{1}{p}}$ (for $p \geq 1$ and $q \geq 1$, and $\mathcal{G}$ partitioning the variables into a set of ``groups''). This is a generalization of L1-regularization that encourages sparsity in groups of variables (setting entire groups to 0). Hint: show that this is a norm.
\item Negative minimum of entropy over pairs of variables: $f(w) = -\min_{\{i,j \; | \; i \neq j\}}\{-w_i\log w_i - w_j\log w_j \}$ subject to $w_i >0$ for all $i$.
}

\subsection{Strong Convexity}

\blu{Which of the functions 1-6 in the previous subsections are strongly convex?}

\pagebreak

\section{Discrete and Gaussian Variables}


\subsection{MLE for General Discrete Distribution}

Consider a density estimation task, where we have two variables ($d=2$) that can each take one of $k$ discrete values. For example, we could have
\[
X = \mat{1 & 3\\4 & 2\\$k$ & 3\\1 & $k-1$}.
\]
The likelihood for example $x^i$ under a general discrete distribution would be
\[
p(x^i_1, x^i_2 \cond \Theta) = \theta_{x_1^i,x_2^i},
\]
where $\theta_{c_1,c_2}$ gives the probability of $x_1$ being in state $c_1$ and $x_2$ being in state $c_2$, for all the $k^2$ combinations of the two variables. In order for this to define a valid probability, we need all elements $\theta_{c_1,c_2}$ to be non-negative and they must sum to one, $\sum_{c_1=1}^k\sum_{c_2=1}^k \theta_{c_1,c_2} = 1$.
\enum{
\item Given $n$ training examples, \blu{derive the MLE for the $k^2$ elements of $\Theta$}.
\item \red{Note that this question was updated on February 4th.}
If we had separate parameters $\theta_{1,c_1}$ and $\theta_{2,c_2}$ for each variable, a standard for the prior would be a product of independent Dirichlet distributions,
\[
p(\Theta) \propto  \left[\prod_{c_1} \theta_{1,c_1}^{\alpha_{c_1}-1}\right]\left[\prod_{c_2} \theta_{c_2}^{\alpha_{c_2}-1}\right],
\]
where in this case the two variables share $k$ hyper-parameters $\alpha_1, \alpha_2, \dots, \alpha_k$. Alternately, we could use a Dirichlet distribution over the $k^2$ variables,
\alignStar{
p(\Theta) & \propto \prod_{c_1=1}^k\prod_{c_2=1}^k \theta_{c_1,c_2}^{\alpha_{c_1,c_2}-1},
}
with $k^2$ hyper-parameters. An alternate approach that could be considered is to use
\alignStar{
p(\Theta) & \propto \prod_{c_1=1}^k\prod_{c_2=1}^k \theta_{c_1,c_2}^{\alpha_{c_1}+\alpha_{c_2}-2},
}
which only uses $k$ hyper-parameters to define a prior over the $k^2$ values.
\blu{Derive the MAP estimate under this prior} (assuming we use $k^2$ variables to parameterize $\Theta$)
\item We often use discrete distributions as parts of more-complicated distributions (like mixture models and graphical models). In these cases we often need to fit a weighted NLL for the form
\[
f(\Theta) = -\sum_{i=1}^n v_i \log p(x_1^i, x_2^i \cond \Theta),
\]
with $n$ non-negative weights. \blu{What is the MLE for the $k^2$ elements of $\Theta$ under this weighting.}
\item Consider the following alternate parameterization of the joint distribution: we use $\theta_{c_1}$ as the (``marginal'') probability that variable 1 is in state $c_1$ ($p(x_1 = c_1) = \theta_{c_1}$) and $\theta_{c_2 \cond c_1}$ as the (``conditional'') probability that variable 2 is in state $c_2$ conditioned on variable 1 being in state $c_1$ ($p(x_2 = c_2 \cond x_1 = c_1) =\theta_{c_2 \cond c_1}$).\footnote{So we could use $p(x_1, x_2) = p(x_2 \cond x_1)p(x_1)$ to get the joint probability.} \blu{Give the MLE for the $k$ elements of $\theta_{c_1}$ and $k^2$ elements of $\theta_{c_2 \cond c_1}$} (you don't need to give the full derivation, but should know how this would be done).
}
Hint: it may be convenient to write the discrete likelihood for an example $i$ in the form
\[
p(x^i \cond \Theta) = \prod_{c \in [k]^2}\theta_c^{\mathcal{I}[x^i = c]},
\]
where $c$ is a vector containing $(c_1,c_2)$, $[x^i = c]$ evaluates to 1 if all elements are equal, and $[k]^2$ is all ordered pairs $(c_1,c_2)$. You can use a Lagrangian to enforce the sum-to-1 constraint on the log-likelihood (bug the TAs to show you how to do this in office hours or tutorial if you don't know how), and you may find it convenient to define $n_c = \sum_{i=1}^n \mathcal{I}[x^i = c]$. Note that the solutions you find using the Lagrangian will already satisfy the non-negative constraint on the parameters.

\pagebreak

\subsection{Gaussian Self-Conjugacy}

Consider $n$ IID samples $x^i$ distributed according to a Gaussian with mean $\mu$ and covariance $\sigma^2 I$,
\[
x^i \sim \mathcal{N}(\mu, \sigma^2 I).
\]
Assume that $\mu$ itself is distributed according to a Gaussian
\[
\mu \sim \mathcal{N}(\mu_0,\Sigma_0),
\]
with mean $\mu_0$ and (positive-definite) covariance $\Sigma_0$. In this setting, the posterior for $\mu$ also follows a Gaussian distribution.\footnote{Because of the prior and posterior coming from the same family, we say that the Gaussian distribution is the ``conjugate prior'' for the Gaussian mean parameter (we'll formally discuss conjugate priors later in the course). Another reason the Gaussian distribution is important is that is the only (non-trivial) continuous distribution that has this ``self-conjugacy'' property.}

 \blu{Derive the form of the posterior distribution, $p(\mu\cond X, \sigma^2, \mu_0, \Sigma_0)$.}\\
Hint: the posterior is a product of Gaussian densities.

\pagebreak

\subsection{Generative Classifiers with Gaussian Assumption}

Consider the 3-class classification dataset in this image:
%\centerfig{.4}{sample}
In this dataset, we have 2 features and each colour represents one of the classes. Note that the classes are highly-structured: the colours each roughly follow a Gausian distribution plus some noisy samples.

Since we have an idea of what the features look like for each class, we might consider classifying  inputs $x$ using a \emph{generative classifier}. In particular, we are going to use Bayes rule to write
\[
p(y^i=c\cond x^i,\Theta) = \frac{p(x^i\cond y^i=c, \Theta) \cdot p(y^i=c\cond\Theta)}{p(x^i\cond\Theta)},
\]
where $\Theta$ represents the parameters of our model. To classify a new example $\tilde{x}^i$, generative classifiers would use
\[
\hat{y}^i = \argmax_{y \in \{1,2,\dots,k\}} p(\tilde{x}^i\cond y^i=c,\Theta)p(y^i=c\cond\Theta),
\]
where in our case the total number of classes $k$ is $3$.\footnote{The denominator $p(\tilde{x}^i\cond\Theta)$ is irrelevant to the classification since it is the same for all $y$.}
% and $\theta_c$ is the set of parameters associated class $c$.
Modeling $p(y^i=c\cond\Theta)$ is easy: we can just use a $k$-state categorical distribution,
\[
p(y^i = c \cond \Theta) = \theta_c,
\]
where $\theta_c$ is a single parameter for class $c$. The maximum likelihood estimate of $\theta_c$ is given by $n_c/n$, the number of times we have $y^i = c$ (which we've called $n_c$) divided by the total number of data points $n$.

Modeling $p(x^i \cond y^i =c, \Theta)$ is the hard part: we need to know the \emph{probability of seeing the feature vector $x^i$ given that we are in class $c$}. This corresponds to solving a density estimation problem for each of the $k$ possible classes. 
To make the density estimation problem tractable, we'll assume that the distribution of $x^i$ given that $y^i=c$ is given by a $\mathcal{N}(\mu_c,\Sigma_c)$ Gaussian distribution for a class-specific $\mu_c$ and $\Sigma_c$,
\[
p(x^i \cond y^i=c, \Theta) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma_c|^{\half}}\exp\left(-\half (x^i-\mu_c)^T\Sigma_c^{-1}(x^i-\mu_c)\right).
\]
Since we are distinguishing between the probability under $k$ different Gaussians to make our classification, this is called \emph{Gaussian discriminant analysis} (GDA). In the special case where we have a constant $\Sigma_c = \Sigma$ across all classes it is known as \emph{linear discriminant analysis} (LDA) since it leads to a linear classifier between any two classes (while the region of space assigned to each class forms a convex polyhedron as in $k$-means clustering and softmax classification). Another common restriction on the $\Sigma_c$ is that they are diagonal matrices, since this only requires $O(d)$ parameters instead of $O(d^2)$ (corresponding to assuming that the features are independent univariate Gaussians given the class label).
Given a dataset $\mathcal{D}=\{(x^i, y^i)\}_{i=1}^n$, where $x^i\in\R^d$ and $y^i\in\{1,\ldots,k\}$, the maximum likelihood estimate (MLE) for the $\mu_c$ and $\Sigma_c$ in the GDA model is the solution to
\[
\argmax_{\mu_1,\mu_2,\dots,\mu_k,\Sigma_1,\Sigma_2,\dots,\Sigma_k} \prod_{i=1}^n p(x^i \cond y^i, \mu_{y^i},\Sigma_{y^i}).
\]
This means that the negative log-likelihood will be  equal to
\alignStar{
- \log p(X\cond y,\Theta) & = -\sum_{i=1}^n \log p(x^i , y^i \cond \mu_{y^i},\Sigma_{y^i})\\
& = \sum_{i=1}^n \frac{1}{2}(x^i - \mu_{y^i})^T\Sigma_{y^i}^{-1}(x^i - \mu_{y^i}) + \half\sum_{i=1}^n \log|\Sigma_{y^i}| + \text{const.}
}
In class we derived the MLE for this model under the assumption that we use full covariance matrices and that each class has its own covariance.
\enum{
\item \blu{Derive the MLE for the GDA model under the assumption of \emph{common diagonal covariance} matrices}, $\Sigma_c = D$ ($d$ parameters). (Each class will have its own mean $\mu_c$.)
\item \blu{Derive the MLE for the GDA model under the assumption of \emph{individual scaled-identity} matrices}, $\Sigma_c = \sigma_c^2 I$ ($k$ parameters).
\item When you run \emph{example\_generative} it loads a variant of the dataset in the figure that has 12 features and 10 classes. This data has been split up into a training and test set, and the code fits a $k$-nearest neighbour classifier to the training set then reports the accuracy on the test data (around $\sim 63\%$ test error). The $k$-nearest neighbour model does poorly here since it doesn't take into account the Gaussian-like structure in feature space for each class label. Write a function \emph{gda} that fits a GDA model to this dataset (using individual full covariance matrices). \blu{Hand in the function and report the test set accuracy}.
\item In this question we would like to replace the Gaussian distribution of the previous problem with the more robust multivariate-t distribution so that it isn't influenced as much by the noisy data.
Unlike the previous case, we don't have a closed-form solution for the parameters. However, if you run \emph{example\_student} it generates random noisy data and fits a multivariate-t model. By using the \emph{studentT} model, write a new function \emph{tda} that implements a generative model that is based on the multivariate-t distribution instead of the Gaussian distribution. \blu{Report the test accuracy  with this model.}
}
Hints: you may be able to substantially simplify the notation in the MLE derivations if you use the notation $\sum_{i \in y_c}$ to mean the sum over all values $i$ where $y^i = c$. Similarly, you can use $n_c$ to denote the number of cases where $y_i = c$, so that we have $\sum_{i \in y_c}1 = n_c$. Note that the determinant of a diagonal matrix is the product of the diagonal entries, and the inverse of a diagonal matrix is a diagonal matrix with the reciprocals of the original matrix along the diagonal. 

For the implementation you can use the result from class regarding the MLE of a general multivariate Gaussian. At test time for GDA, you may find it more numerically reasonable to compare log probabilities rather than probabilities of different classes, and you may find it helpful to use the \emph{logdet}  function to compute the log-determinant in a more numerically-stable way than taking the log of the determinant. (Also, don't forget to center at training and test time.)

For the last question, you may find it helpful to define an empty array that can be filled with $k$ \emph{DensityModel} objects  using:
\begin{verbatim}
   subModel = Array{DensityModel}(undef,k)
\end{verbatim}

\pagebreak

\section{Mixture Models and Expectation Maximization}


\subsection{Categorical Mixture Model}

Consider density estimation with examples $x^i \in \{1,2,\dots,k\}^d$ representing a set of $d$ categorical variables. In this setting, a natural way to model an individual variable $x^i_j$ would be with a categorical distribution,
\[
p(x^i_j = c \cond \theta_{j,1}, \theta_{j,2}, \dots, \theta_{j,k}) = \theta_{j,c},
\]
so the joint distribution would be
\[
p(x^i \cond \theta_{j,1}, \theta_{j,2}, \dots, \theta_{j,k}) = \prod_{j=1}^d \theta_{j,x^i_j},
\]
where all $\theta_{j,c} \geq 0$ and $\sum_{c=1}^k \theta_{j,c} = 1$ for each feature $j$.
However, in this distribution the variables would are independent. One way to model dependent count variables would be with a mixture of $m$ independent categorical distributions,
\[
p(x^i\cond  \Theta) = \sum_{c=1}^{m}p(z^i = c\cond \Theta^t)p(x^i \cond z^i = c, \Theta^t) =  \sum_{c = 1}^m \pi_c \prod_{j=1}^d \theta_{j,x_j^i}^{c},
\]
where:
\items{
\item $\pi_c$ is the probability that the examples comes from mixture $c$, $p(z^i = c \cond \Theta) = \pi_c$. 
\item $\theta_{j,c'}^c$ is the probability that $x_j^i$ is $c'$ for examples from mixture $c$, $p(x_j^i = c' \cond z^i = c, \Theta)$.
\item We use $\Theta$ as the set containing all the $\pi_c$ and $\theta_{j,c'}^c$ values.
}
\blu{Derive the EM update for this mixture model} (treating the $z^i$ as missing values).

Hint: a lot of the work has been done for you in the EM notes on the course webpage.

\pagebreak

\subsection{Semi-Supervised Gaussian Discriminant Analysis}

Consider fitting a GDA model where some of the $y^i$ values are missing at random. In particular, let's assume we have a set of $n$ labeled examples $(x^i,y^i)$ and another set of $t$ unlabeled examples $(x^i)$. This is a special case of \emph{semi-supervised learning}, and fitting generative models with EM is one of the oldest semi-supervised learning techniques. When the classes exhibit clear structure in the feature space, it can be very effective even if the number of labeled examples is very small.

\enum{
\item \blu{Derive the EM update for fitting the parameters of a GDA model (with individual full covariance matrices) in the semi-superivsed setting where we have $n$ labeled examples and $t$ unlabeled examples}.
\item If you run the demo \emph{example\_SSL.jl}, it will load a variant of the dataset from the GDA, but where the number of labeled examples is small and a large number of unlabeled examples are available. The demo first fits a KNN model and then a generative Gaussian model (once you are finished the GDA question and assuming that you put your \emph{gda} function in a file named \emph{gda.jl}).
Because the number of labeled examples it quite small, the performance is worse than in the GDA question. Write a function \emph{gdaSSL} that fits the generative Gaussian model of the previous question using EM to incorporate the unlabeled data. \blu{Hand in the function and report the test error when training on the full dataset}.
\item Repeat the previous part, but using the imputation approach (``hard''-EM) where we explicitly classify all the unlabeled examples before each model update. \blu{How does this change the performance and the number of iterations}?
}

Hint: for the first question most of the work has been done for you in the EM notes on the course webpage. You can use the result (**) and the update of $\theta_c$ from those notes, but you will need to work out the update of the parameters of the Gaussian distribution $p(x^i | y^i, \Theta)$.

For the second question, although EM often leads to simple updates, implementing them correctly can often be a pain. One way to help debug your code is to compute the observed-data log-likelihood after every iteration. If this number goes down, then you know your implementation has a problem. You can also test your updates of sets of variables in this way too. For example, if you hold the $\mu_c$ and $\Sigma_c$ fixed and only update the $\theta_c$, then the log-liklihood should not go down. In this way, you can test each combination of updates on its own to make sure they are correct.

\pagebreak

\section{Very-Short Answer Questions}


Give a short and concise 1-sentence answer to the below questions.
\enum{
\item What is a situation where we can violate the golden rule on our test set, and still have it be a reasonable approximation of test error?
\item Suppose we have a way to generate new IID samples for our problem. If we are fitting a model with SGD, what would be the key advantage of using new IID samples as opposed to sampling from a fixed training set?
\item Is the minimum of a convex function achieved by a unique input value? What about a strictly-convex function or a strongly-convex function?
\item How can we use density estimation for supervised learning?
\item How many parameters does the general discrete distribution have when each feature is a categorical variable that can take up to $k$ values.
\pagebreak
\item What is the relationship between the covariance matrix $\Sigma$ and the precision matrix $\Theta$ of a multivariate Gaussian?
\item Suppose we run the graphical LASSO method and it returns a tri-diagonal precision matrix. What would the graph look like?
\item Suppose we have a lot of extreme outliers in our dataset. Why is this less of a problem for a mixture of Gaussians than if we use a single Gaussian?
\item Why do the $\pi_c$ need to be a convex combination in mixture models?
\pagebreak
\item What is the relationship between the supervised naive Bayes model and the unsupervised mixture of Bernoullis model?
\item What is the difference between ``missing at random'' and ``missing completely at random''?
\item What is an advantage of the Epanechnikov kernel over the Gaussian kernel.
\item How does parameter tieing allow us to model training examples that have different sizes?
}
\pagebreak


\section*{CPSC 540 Project Proposal}

\blu{This section is only to be done by students enrolled in CPSC 540.}

For the final part of this assignment, you must a \blu{submit a project proposal} for your course project. The proposal should be a maximum of 2 pages (and 1 page or half of a page is ok if you can describe your plan concisely). The proposal should be written for me and the TAs, so you don't need to introduce any ML background but you will need to introduce non-ML topics. The projects must be done in groups of 2-3, and will be submitted separately from the assignment.

There is quite a bit of flexibility in terms of the type of project you do, as I believe there are many ways that people can make valuable contributions to research. However, note that ultimately the project will have three parts:
\enum{
\item A very short paper review summarizing the pros and cons of a particular paper on the topic (due with Assignment 3).
\item A short literature review summarizing at least 10 papers on a particular topic (due with Assignment 4).
\item A final report containing at most 6 pages of text (the actual document can be longer due to figures, tables, references, and proofs) that emphasizes a particular ``contribution" (i.e., what doing the project has added to the world).
}

\blu{The three mains ingredients of the project proposal are:
\begin{enumerate}
\item What problem you are focusing on.
\item What you plan to do.
\item What will be the ``contribution".
\end{enumerate}
}
Also, note that for the course project that negative results (like ``we tried something that we thought we would work in a particular setting but it didn't work'') are acceptable (and often unavoidable).

Here are some standard project ``templates" that you might want to follow:

\items{
\item \textbf{Application bake-off}: you pick a specific application (from your research, personal interests, or maybe from Kaggle) or a small number of related applications, and try out a bunch of techniques (e.g., random forests vs. logistic regression vs. generative models). In this case, the contribution would be showing that some methods work better than others for this specific application (or your contribution could be that everything works equally well/badly).
\item \textbf{New application}: you pick an application where where people aren't using ML, and you test out whether ML methods are effective for the task. In this case, the contribution would be knowing whether ML is suitable for the task.
\item \textbf{Scaling up}: you pick a specific machine learning technique, and you try to figure out how to make it run faster or on larger datasets (for example, how do we apply kernel methods when $n$ is very large). In this case, the contribution would be the new technique and an evaluation of its performance, or could be a comparison of different ways to address the problem.
\item \textbf{Improving performance}: you pick a specific machine learning technique, and try to extend it in some way to improve its performance (for example, how can we efficiently use non-linearity within graphical models). In this case, the contribution would be the new technique and an evaluation of its performance.
\item \textbf{Generalization to new setting}: you pick a specific machine learning technique, and try to extend it to a new setting (for example, making a graphical-model version of random forests).  In this case, the contribution would be the new technique and an evaluation of its performance, or could be a comparison of different ways to address the problem.
\item \textbf{Perspective paper}: you pick a specific topic in ML, read a larger number of papers on the topic, then write a report summarizing what has been done on the topic and what are the most promising directions of future work. In this case, the contribution would be your summary of the relationships between the existing works, and your insights about where the field is going.
\item \textbf{Coding project}: you pick a specific method or set of methods (like independent component analysis), and build an implementation of them. In this case, the contribution could be the implementation itself or a comparison of different ways to solve the problem.
\item \textbf{Theory}: you pick a theoretical topic (like the variance of cross-validation or the convergence of proximal stochastic gradient in the non-convex setting), read what has been done about it, and try to prove a new result (usually by relaxing existing assumptions or adding new assumptions). The contribution could be a new analysis of an existing method, or why some approaches to analyzing the method will not work.
}
The above are just suggestions, and many projects will mix several of these templates together, but if you are having trouble getting going then it's best to stick with one of the above templates. Also note that the above includes topics not covered in the course (like random forests), so there is flexibility in the topic, but the topic should be closely-related to ML.

\blu{This question is mandatory but will not be formally marked: it's just a sanity check that you have at least one project idea that fits within the scope of a 540 course project, and it's an excuse for you to allocate some time to thinking about the project.} Also, there is flexibility in the choice of project topics even after the proposal: if you want to explore different topics you can ultimately choose to do a project that is unrelated to the one in your proposal/paper-review/literature-review, although it will likely be easier to do all 4 parts on the same topic.


\end{document}