\documentclass{article}

\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{nicefrac}
\usepackage{listings} % For displaying code
\usepackage{algorithm2e} % pseudo-code
\usepackage{enumitem}

% Answers
\def\ans#1{\par\gre{Answer: #1}}
%\def\ans#1{} % Comment this line to produce document with answers

% Colors
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}

% Math
\def\R{\mathbb{R}}
\def\argmax{\mathop{\rm arg\,max}}
\def\argmin{\mathop{\rm arg\,min}}
\newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\alignStar}[1]{\begin{align*}#1\end{align*}}
\def\half{\frac 1 2}

% LaTeX
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{a1f/#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{a1f/#2}\end{center}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}
\def\splitenum{\end{enumerate}\clearpage\begin{enumerate}[resume]}


\begin{document}

\title{CPSC 440/540 Machine Learning (January-April, 2021)\\Assignment 1 (due January 22nd at midnight)}
\author{}
\date{}
\maketitle
\vspace{-4em}


\textbf{IMPORTANT!!!!! Please carefully read the submission instructions that will be posted on Piazza We will deduct 50\% on assignments that do not follow the instructions}.

Most of the questions below are related to topics covered in CPSC 340, or other courses listed on the prerequisite form. There are several ``notes'' available on the webpage which can help with some relevant background.

If you find this assignment to be difficult overall, that is an early warning sign that you may not be prepared to take CPSC 440
at this time. Future assignments will be longer and more difficult than this one.

We use \blu{blue} to highlight the deliverables that you must answer/do/submit with the assignment.

\section*{Basic Information}


\blu{\enum{
\item Name:
\item Student ID:
%\item Faculty (e.g., applied science):
%\item Department (e.g., computer science):
%\item Have you submitted the prereq form?
}}


\clearpage
\section{Very-Short Answer Questions}


Give a short and concise 1-2 sentence answer to the below questions. All acronyms and methods are as used in CPSC 340/532M.

\blu{
\enum{
\item A common pre-processing operation is to \emph{standardize} our features. This operation replaces each $x_j^i$ with $(x_j^i - \mu_j)/\sigma_j$, where $\mu_j$ is the mean of feature $j$ on the training data and $\sigma_j$ is the standard deviation of feature $j$ on the training data. When doing predictions with the model, some codes standardize the test features $\tilde{x}_j^i$ with a mean $\tilde{\mu}_j$ and standard deviation $\tilde{\sigma}_j$ computing based on the test data. Describe why this usually is a mistake, and describe a setting where it can lead to terrible performance even if the training and test data come IID from the same distribution.
\item What is the difference between a test set error and the test error?
\item  In the deep learning world, \emph{neural architecture search} describes the process of searching for the hyper-parameters of a neural network model (like the number of layers, the step size, whether to use convolutions, the regularization parameters, and so on) in order to optimize the performance on a fixed validation set. What is a potential problem with this approach?
\splitenum
\item In a parametric model, what is the effect of the number of features $d$ that our model uses on the training error and on the approximation error (the difference between the training error and test error)?  
\item Give a way to set the depth and width of a neural network that makes the model parametric, and a choice that makes the model non-parametric.
\item You can fit a decision stump by finding the stump that maximizes the number of times $\hat{y}^i = y^i$ in the training data. Why do we not try to maximize this quantity when we fit the regression weights $w$ in linear regression models?
\splitenum
\item  How does $\lambda$ in an L1-regularizer affect the sparsity pattern of the solution (number of $w_j$ set to exactly 0), the training error, and the approximation error?
\item Minimizing the squared error used by in k-means clustering is NP-hard. What would be the significance of an algorithm that minimizes this error that runs in time $O(nd^2 + d^3)$.
\item Consider the regression objective given by $f(w) =\sum_{i=1}^n\max\{0,|y^i - w^Tx^i| - \epsilon\} + \frac{\lambda}{2}\norm{w}^2$ for some number $\epsilon$. Describe a method to minimize this objective function (or closely approximate the minimum).
\splitenum
\item Consider an ensemble clustering method that generates $m$ different bootstraps of the data. It then fits a k-means model (with a random initialization) to each of the boostraps. To form the final clustering for example $x_i$, it chooses the $y_i$ that is most common across the $m$ clusterings. Would this be an effective or an inneffective ensemble method? (Briefly explain.)
\item What does minimizing $f(w)=||Xw-y||_1+\lambda ||w||^2$ correspond to in the MLE/MAP view?
\item If you perform MLE with a Gaussian likelihood, would the coefficient $\hat{w}$ change if you standardized the features? Would the predictions $\hat{y}^i$ change? Which of these would change if you perform MAP estimation with a Gaussian likelihood and Gaussian prior.
\splitenum
\item Do you expect the resulting loss from running NMF 
to be lower, higher, or the same as the loss from running PCA on the same data? 
Briefly justify your answer.
\item Suppose we need to multiply a huge number of matrices to compute a product like $A_1 A_2A_3 \cdots A_k$. The matrices have wildly-different sizes so the order of multiplication will affect the runtime. For example, $A_1(A_2A_3)$ may be faster to compute than $(A_1A_2)A_3$. Describe (at a high level) an $O(k^3)$-time algorithm that finds the lowest-cost order to multiply the matrices.
\item You have a supervised learning dataset $\{X,y\}$. You fit a 1-hidden-layer neural network using stochastic gradient descent to minimize the squared error, that makes predictions of the form $\hat{y}^i = v^\top Wx^i$ where $W$ and $v$ are the parameters. Explain why or why not this neural network can achieve a better training accuracy than the basic linear regression model $\hat{y}^i = w^\top x^i$.
}
}


\clearpage
\section{Coding Questions}



If you have not previously used Julia, there is a list of useful Julia commands (and syntax) among the list of notes on the course webpage.

\subsection{Regularization and Hyper-Parameter Tuning}


Download \emph{a1.zip} from the course webpage, and start Julia (latest version) in a directory containing the extracted files. If you run the script \emph{example\_nonLinear} (from Julia's REPL), it will:
\enum{
\item Load a one-dimensional regression dataset.
\item Fit a least-squares linear regression model.
\item Report the test set error.
\item Draw a figure showing the training/testing data and what the model looks like.
}
This script uses the \emph{JLD} package to load the data and the \emph{PyPlot} package to make the plot. If you have not previously used these packages, they can be installed using:
\begin{verbatim}
using Pkg
Pkg.add("JLD")
Pkg.add("PyPlot")
\end{verbatim}

Unfortunately, this is not a great model of the data, and the figure shows that a linear model is probably not suitable.
\enum{
\item Write a function called \emph{leastSquaresRBFL2} that implements \emph{least squares using Gaussian radial basis functions (RBFs) and L2-regularization}. \\You should start from the \emph{leastSquares} function and use the same conventions: $n$ refers to the number of training examples, $d$ refers to the number of features, $X$ refers to the data matrix, $y$ refers to the targets, $Z$ refers to the data matrix after the change of basis, and so on. Note that you'll have to add two additional input arguments ($\lambda$ for the regularization parameter and $\sigma$ for the Gaussian RBF variance) compared to the \emph{leastSquares} function. To make your code easier to understand/debug, you may want to define a new function \emph{rbfBasis} which computes the Gaussian RBFs for a given training set, testing set, and $\sigma$ value. \blu{Hand in your function and the plot generated with $\lambda = 1$ and $\sigma = 1$.}\\
Note:  the \emph{distancesSquared} function in \emph{misc.jl} is a vectorized way to quickly compute the squared Euclidean distance between all pairs of rows in two matrices.
\item Modify the script to split the training data into a ``train'' and ``validation'' set (you can use half the examples for training and half for validation), and use these to select $\lambda$ and $\sigma$. \blu{Hand in your modified script and the plot you obtain with the best values of $\lambda$ and $\sigma$.}
\item There are reasons why this dataset is particularly well-suited to Gaussian RBFs are that (i) the period of the oscillations stays constant and (ii) we have evenly sampled the training data across its domain. If either of these assumptions are violated, the performance with our Gaussian RBFs might be much worse.
\blu{Consider a scenario where either (i) or (ii) is violated, and describe a way that you could address this problem.}
}






\clearpage
\subsection{Multi-Class Logistic Regression}


The script \emph{example\_multiClass.jl} loads a multi-class classification dataset and fits a ``one-vs-all'' logistic regression classifier using the \emph{findMin} gradient descent implementation, then reports the validation error and shows a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For example, this classifier never even predicts some of the classes.

Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no attempt to calibrate the columns of the matrix $W$. An alternative to this independent model is to use the softmax probability,
\[
p(y^i | W, x^i) = \frac{\exp(w_{y^i}^\top x^i)}{\sum_{c=1}^k\exp(w_c^\top x^i)}.
\]
Here $c$ is a possible label and $w_{c}$ is column $c$ of $W$. Similarly, $y^i$ is the training label, $w_{y^i}$ is column $y^i$ of $W$. The loss function corresponding to the negative logarithm of the softmax probability is given by
\[
f(W) = \sum_{i=1}^n \left[-w_{y^i}^\top x^i + \log\left(\sum_{c' = 1}^k \exp(w_{c'}^\top x^i)\right)\right].
\]
Make a new function, \emph{softmaxClassifier}, which fits $W$ using the softmax loss from the previous section  instead of fitting $k$ independent classifiers. \blu{Hand in the code and report the validation error}.

Hint: you can use the \emph{derivativeCheck} option when calling \emph{findMin} to help you debug the gradient of the softmax loss. Also, note that the \emph{findMin} function treats the parameters as a vector (you may want to use \emph{reshape} when writing the softmax objective).

\clearpage
\subsection{Principal Component Analysis}

The script \emph{example\_PCA} will load a dataset containing 50 examples, each representing an animal. The 85 features are traits of these animals. The script gives two unsatisfying visualizations of it. First it shows a plot of the matrix entries, which has too much information and thus gives little insight into the relationships between the animals. Next it shows a scatterplot based on two random features and displays the name of 10 randomly-chosen animals. Because of the binary features even a scatterplot matrix shows us almost nothing about the data.

The function \emph{PCA} applies the classic PCA method (orthogonal bases via SVD) for a given $k$. Using this function, modify the demo so that the scatterplot uses the latent features $z_i$ from the PCA model with $k=2$. Make a scatterplot of the two columns in $Z$, and use the \emph{annotate} function to label a bunch of the points in the scatterplot.
\enum{
\item  \blu{Hand in your modified demo and the scatterplot.}
\item \blu{Which trait of the animals has the largest influence (absolute value) on the first principal component?} (Make sure not to forget the ``+1'' when looking for the name of the trait in the \emph{dataTable}).
\item \blu{Which trait of the animals has the largest influence (absolute value) on the second principal component?}
\item \blu{How much of the variance in $X$ (after centering) is explained by our two-dimensional representation from the previous question?}\\
Note: you can compute the Frobenius norm of a matrix using the function \emph{norm}. Also, note that the ``variance explained'' formula from CPSC 340 assumes that $X$ is already centered.
\item \blu{How many\ PCs are required to explain 50\% of the variance in the data?}
}


\clearpage
\section{Calculation Questions}






\subsection{Minimizing Strictly-Convex Quadratic Functions}

Solve for the minimizer $w$ of the below strictly-convex quadratic functions:
\blu{\enum{
\item $f(w) = \frac{1}{2}w^\top \Lambda w + u^\top w + \lambda$ (general quadratic).
\item $f(w)= \half (Xw - y)^\top\Sigma^{-1}(Xw-y) + \frac \lambda 2 \norm{w}^2$ (L2-regularized least squares with weight matrix $\Sigma$).
\item $f(w) = \frac{1}{2}\sum_{i=1}^n v_i(w^\top x^i - y^i)^2 + \frac{\lambda}{2}\norm{w-u}^2$ (weighted least squares shrunk towards $u$).
}}
Above we use our usual supervised learning notation. In addition, we assume that $u$ is $d \times 1$ and $v$ is $n \times 1$, $\lambda$ is a positive scalar, and $\Sigma$ and $\Lambda$ are symmetric positive-definite matrices. You can use $V$ as a diagonal matrix with $v$ along the diagonal (with the $v_i$ non-negative). You can use $I$ as an identity matrix. Hint: positive-definite matrices are invertible.



\clearpage
\subsection{MAP Estimation}


In 340, we showed that under the assumptions of a Gaussian likelihood and Gaussian prior,
\[
y^i \sim \mathcal{N}(w^\top x^i,1), \quad w_j \sim \mathcal{N}\left(0,\frac{1}{\lambda}\right),
\]
that the MAP estimate is equivalent to solving the L2-regularized least squares problem
\[
f(w) = \frac{1}{2}\sum_{i=1}^n (w^\top x^i - y^i)^2 + \frac \lambda 2 \sum_{j=1}^d w_j^2,
\]
in the ``loss plus regularizer'' framework.
For each of the alternate assumptions below, write it in the ``loss plus regularizer'' framework (simplifying as much as possible):
\blu{\enum{
\item Laplace likelihood (with a scale of 1) for each training example and Laplace prior centered at $u$ with scale $1/\lambda$.
\[
y^i \sim \mathcal{L}(w^\top x^i,1), \quad w_j \sim \mathcal{L}\left(u,1/\lambda\right),
\]
where $u$ is $d \times 1$.
\item Gaussian likelihood with a separate variance $\sigma_i^2$ for each training example, and Gaussian prior with a separate variance $1/\lambda_j$ for each variable,
\[
y^i \sim \mathcal{N}(w^Tx^i,\sigma_i^2), \quad w_j \sim \mathcal{N}\left(0,\frac{1}{\lambda_j}\right).
\]
\item  Time-independent censored survival analysis likelihood with a uniform prior,
\[
p(y^i, v^i | x^i, w) = \exp(v^iw^Tx^i)\exp(-y^i\exp(w^Tx^i)), \quad p(w_j) \propto \kappa
\]
Here, $y^i$ is a positive number giving the latest time that we observed patient $i$, and $v^i = 1$ if patient $i$ has quit the study while $v^i = 0$ if they are still in it.\footnote{This likelihood can be used in regression settings to estimate survival times when some patients are still alive.}
}}
For this question, you do not need to convert to matrix notation.


\clearpage
\subsection{Machine Learning Model Memory and Time Complexities}


Answer the following questions using big-O notation. 
Your answers may involve $n$, $d$, and perhaps additional quantities defined in the question. 
As an example, (linear) least squares model has $O(d)$ parameters and requires $O(nd^2 + d^3)$ time to train.\footnote{In this course, we assume matrix operations have the ``textbook'' cost where the operations are implemented in a straightforward way with ``for'' loops. For example, we'll assume that multiplying two $n \times n$ matrices or computing a matrix inverse simply costs $O(n^3)$, rather than the $O(n^\omega)$ where $\omega$ is closer to $2$ as discussed in CS algorithm courses.}

\blu{
\enum{
\item What is the storage space required for a naive Bayes classifier with binary features and $k$ class labels?
\item What is the training time required for $k$-means clustering? You can use $t$ as the number of iterations it takes to converge.
\item  What is the training time for linear regression with Gaussian RBF features? You can use $\sigma^2$ as the variance of the Gaussian RBFs.
\item What is the storage space required for an L2-regularized linear regression model with a polynomial basis, where we have used the kernel trick? You can use $\lambda$ as the regularization parameter and $p$ as the degree of the polynomial.
\item What is the cost of trying to minimize the logistic regression loss by running $t$ iterations of stochastic gradient descent?
\item What is the cost of forward propagation (computing one value $\hat{y}_i$) in a neural network (for regression) with 3 fully-connected hidden layers? Use $k_1$ as the number of hidden units in layer 1, $k_2$ as the number of hidden units in layer 2, and $k_3$ as the number of hidden units in layer 3.
}
}



\clearpage
\subsection{Gradients and Hessians in Matrix Notation}

Express the gradient $\nabla f(w)$ and Hessian $\nabla^2 f(w)$ of the following functions in matrix notation, simplifying as much as possible:
\blu{
\enum{
\item The quadratic function
\[
f(w) = w^Tu + u^TAw + \frac{\lambda}{2}w^Tw + w^TAw,
\]
wher $u$ is $d \times 1$ and $A$ is $d \times d$ (not necessarily symmetric).
\item L2-regularized weighted least squares with non-Euclidean quadratic regularizaiton,
\[
f(w) = \frac{1}{2}\sum_{i=1}^n v_i(w^\top x^i - y^i)^2 + \frac{1}{2}\sum_{i=1}^d\sum_{j=1}^d w_iw_j\lambda_{ij}
\]
where you can use $V$ as a matrix with the $v_i$ along the diagonal and $\Lambda$ as a positive-definite $d \times d$ (symmetric) matrix with $\lambda_{ij}$ in position $(i,j)$.
\item Weighted L2-regularized probit regression,
\[
f(w) = - \sum_{i=1}^n \log p(y^i | x^i w) + \frac{1}{2}\sum_{j=1}^d u_jw_j^2.
\]
where $u$ is $d \times 1$, $y^i \in \{-1,+1\}$, and the likelihood of a single example $i$ is given by
\[
p(y^i| x^i, w) = \Phi(y^iw^Tx^i).
\]
where $\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution.
}}

Hint: You can use the results from the linear and quadratic gradients and Hessians notes to simplify the derivations. You can use $0$ to represent the zero vector or a matrix of zeroes and $I$ to denote the identity matrix. It will help to convert the second question to matrix notation first. For the last question, it is useful to define a vector $c$ containing the CDF $\Phi(y^iw^Tx^i)$ as element $c_i$ and a vector $p$ containing the corresponding PDF as element $p_i$. For the probit question you'll need to define new vectors to express the gradient and Hessian in matrix notation (and remember the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right dimension.


\clearpage
\subsection{Norm Inequalities}

Show that the following inequalities hold for vectors $w \in \R^d$, $u \in \R^d$, and $X \in \R^{n\times d}$:
\blu{\enum{
\item $\norm{w}_\infty \leq \norm{w}_2 \leq \norm{w}_1$ (relationship between decreasing $p$-norms)
\item $\norm{w}_1 \leq \sqrt{d}\norm{w}_2 \leq d\norm{w}_\infty$ (relationship between increasing $p$-norms)
\item $\sqrt{m}\norm{w}_2 \leq \norm{w}_H \leq \sqrt{M}\norm{w}_2$ (relationship between quadratic norm and Euclidean norm).
}
}
You should use the definitions of the norms, but should not use the 
known equivalences between these norms (since these are the things you are trying to prove).
Hint: for many of these it's easier if you work with squared values (and you may need to ``complete the square"). Beyond non-negativity of norms, it may also help to use the Cauchy-Schwartz inequality, and/or to use that $\norm{x}_1 = x^\top $sign$(x)$. We've used $M$ as the largest eigenvalue of the (positive-definite) matrix $H$ and $m$ as the smallest eigenvalue, so $mI \preceq H \preceq MI$.





 
\end{document}