\documentclass{article}
\usepackage{etoolbox}
\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{listings} % For displaying code
\begin{document}
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}
\newcommand{\argmin}[1]{\mathop{\hbox{argmin}}_{#1}}
\newcommand{\argmax}[1]{\mathop{\hbox{argmax}}_{#1}}
\def\R{\mathbb{R}}
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{#2}\end{center}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}
\def\answer#1{\iftoggle{answers}{\blu{Answer}:\\#1}}
\def\argmax{\mathop{\rm arg\,max}}
\def\argmin{\mathop{\rm arg\,min}}
\def\half{\frac 1 2}
\newcommand{\code}[1]{\lstinputlisting[language=Matlab]{#1}}
\title{CPSC 540 Assignment 1 (due January 16)}
\author{Fundamentals, Convex Functions, Numerical Optimization}
\date{}
\maketitle
\begin{itemize}
\item You can work in groups of 1-3 people on the assignments, but please only hand in \emph{one} assignment.
\item Place the names and student numbers of all group members on the first page, and \blu{submit all answers as a single PDF file to handin}.
\item Organize your PDF file sequentially according to the section numbers in this document. Place answers (including code/figures) in the appropriate section.
\item You will lose marks if answers are unclear or if the TA can't easily find the answers.
\item At the start of each question, acknowledge sources of help from outside of class/textbook (classmates outside the group, online material, papers, etc.).
\item All Sections are equally weighted, except for Section 0.
\item The code and data referred to in the assignment is available in \emph{a1.zip}.
\item Any modifications/updates/clarifications after the assignment is first put online will be marked in \red{red}.
\item Some of the questions require a bit of work, but no individual question is intended to take a lot of time. If you are really stuck, try coming up with a good question and posting on Piazza (other people are likely stuck on the same issue).
\end{itemize}
\setcounter{section}{-1}
\section{Course Prerequisite Form}
\blu{Graduate students in CPSC or EECE must submit the prerequisite form}:\\
\url{https://www.cs.ubc.ca/~schmidtm/Courses/540_prereqs.pdf}\\
Hand this in at the start of one of the first 3 lectures, or at the start of the first tutorial on January 12th. Students who don't submit the form may be dropped from the course.
\section{Fundamentals}
The purpose of this question is to give you practice using the mathematical and coding notation that we will adopt in the course.
\subsection{Matrix Notation}
For this question we'll use the following Householder-like notation:
\enum{
\item $\alpha$ is a scalar.
\item $w$, $a$, and $b$ are $d$ by $1$ column-vectors.
\item $y$ and $v$ are $n$ by $1$ column-vectors (with elements $y^i$ and $v_i$).
\item $A$ is a $d$ by $d$ matrix, not necessarily symmetric (with elements $a_{ij}$).
\item $V$ is a diagonal matrix with $v$ along the diagonal.
\item $B$ is a diagonal matrix with $b$ along the diagonal.
\item $X$ is a $n$ by $d$ matrix (with rows $(x^i)^T$).
%\item $a$ and $b$ are length-$d$ column-vectors.
%\item Element $i$ of $b$ is denoted by $b_i$.
%\item $A$ and $B$ are $d$ by $d$ matrices.
%\item Row $i$ row of $A$ is denoted by $a_i^T$.
%\item $W$ is a \emph{symmetric} $d$ by $d$ matrix.
}
\blu{Express the gradient $\nabla f(w)$ and Hessian $\nabla^2 f(w)$ of the following functions in matrix notation, simplifying as much as possible}.
\enum{
\item The linear function
\[
f(w) = w^Ta + \alpha + \sum_{j=1}^d w_ja_j.
\]
\item The linear function
\[
f(w) = a^Tw + a^TAw + w^TA^Tb.
\]
\item The quadratic function
\[
f(w) = w^Tw + w^TX^TXw + \sum_{i=1}^d\sum_{j=1}^d w_iw_ja_{ij}.
\]
\item L2-regularized weighted least squares,
\[
f(w) = \frac{1}{2}\sum_{i=1}^n v_i(w^Tx^i - y^i)^2 + \frac{\lambda}{2}\norm{w}^2.
\]
\item Weighted L2-regularized probit regression,
\[
f(w) = - \sum_{i=1}^n \log p(y^i | x^i w) + \frac{1}{2}\sum_{j=1}^d b_jw_j^2.
\]
where $y^i \in \{-1,+1\}$ and the likelihood of a single example $i$ is given by
\[
p(y^i| x^i, w) = \Phi(y^iw^Tx^i).
\]
where $\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution.
}
Hint: You can use the results we showed in class to simplify the derivations. You can use $0$ to represent the zero vector or a matrix of zeroes and $I$ to denote the identity matrix. It will help to convert the fourth example to matrix notation first. For the fifth example, it is useful to define a vector $c$ containing the CDF $\Phi(y^iw^Tx^i)$ as element $c_i$ and a vector $p$ containing the corresponding PDF as element $p_i$. For the fifth one you'll need to define new vectors to express the gradient and Hessian in matrix notation (and remember the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right dimension.
\subsection{Regularization and Cross-Validation}
Download \emph{a1.zip} from the course webpage, and start Matlab in a directory containing the extracted files. If you run the script \emph{example\_nonLinear}, it will:
\enum{
\item Load a one-dimensional regression dataset.
\item Fit a least-squares linear regression model.
\item Report the test error.
\item Draw a figure showing the training/testing data and what the model looks like.
}
Unfortunately, this is not a great model of the data, and the figure shows that a linear model is probably not suitable.
\enum{
\item Write a function called \emph{leastSquaresRBFL2} that implements \emph{least squares using Gaussian radial basis functions (RBFs) and L2-regularization}. \\You should start from the \emph{leastSquares} function and use the same conventions: $n$ refers to the number of training examples, $d$ refers to the number of features, $X$ refers to the data matrix, $y$ refers to the targets, $Z$ refers to the data matrix after the change of basis, and so on. Note that you'll have to add two additional input arguments ($\lambda$ for the regularization parameter and $\sigma$ for the Gaussian RBF variance) compared to the \emph{leastSquares} function. To make your code easier to understand/debug, you may want to define a new function \emph{rbfBasis} which computes the Gaussian RBFs for a given training set, testing set, and $\sigma$ value. \blu{Hand in your function and the plot generated with $\lambda = 1$ and $\sigma = 1$.}
\item When dealing with larger datasets, an important issue is the dependence of the computational cost on the number of training examples $n$ and the number of features $d$. \blu{What is the cost in big-O notation of training the model on $n$ training examples with $d$ features under (a) the linear basis, and (b) Gaussian RBFs (for a fixed $\sigma$)? What is the cost of classifying $t$ new examples under these two bases? } Assume that multiplication by an $n$ by $d$ matrix costs $O(nd)$ and that inverting a $d$ by $d$ linear system costs $O(d^3)$.
\item Modify the training/validation procedure to use 10-fold cross-validation on the training set to select $\lambda$ and $\sigma$. \blu{Hand in your cross-validation procedure and the plot you obtain with the best values of $\lambda$ and $\sigma$}
}
Note: If you find that calculating the Euclidean distances between all pairs of points takes too long, the following code will form a matrix containing the squared Euclidean distances between all training and test points:
\begin{verbatim}
[n,d] = size(X);
[t,d] = size(Xtest);
D = X.^2*ones(d,t) + ones(n,d)*(Xtest').^2 - 2*X*Xtest';
\end{verbatim}
Element $D(i,j)$ gives the squared Euclidean distance between training point $i$ and testing point $j$.
\subsection{MAP Estimation}
In class, we showed that under the assumptions
\[
y^i \sim \mathcal{N}(w^Tx^i,1), \quad w_j \sim \mathcal{N}\left(0,\frac{1}{\lambda}\right),
\]
the MAP estimate is equivalent to solving the L2-regularized least squares problem
\[
f(w) = \frac{1}{2}\norm{Xw - y}^2 + \frac \lambda 2 \norm{w}^2,
\]
in the ``loss plus regularizer'' framework.
\blu{For each of the alternate assumptions below, write it in the ``loss plus regularizer'' framework} (simplifying as much as possible, including converting to matrix notation):
\enum{
\item Laplace distribution likelihoods and priors,
\[
y^i \sim \mathcal{L}(w^Tx^i,1), \quad w_j \sim \mathcal{L}\left(0,\frac{1}{\lambda}\right).
\]
\item Gaussians with separate variance for each training example and variable,
\[
y^i \sim \mathcal{N}(w^Tx^i,\sigma_i^2), \quad w_j \sim \mathcal{N}\left(0,\frac{1}{\lambda_j}\right).
\]
\item Poisson-distributed likelihood (for the case where $y^i$ represents discrete counts) and Gaussian prior,
\[
y^i \sim \mathcal{P}(\exp(w^Tx^i)), \quad w_j \sim \mathcal{N}\left(0,\frac{1}{\lambda}\right),
\]
}
\section{Convex Functions}
\subsection{Minimizing Strictly-Convex Quadratic Functions}
Solve for the minimizer of the below strictly-convex quadratic functions:
\blu{\enum{
\item $f(w) = \frac{1}{2}\norm{w-v}^2$ (projection of $v$ onto real space).
\item $f(w)= \frac{1}{2}\norm{Xw - y}^2 + \frac{1}{2}w^T\Lambda w$ (least squares with quadratic-norm regularization).
\item $f(w) = \frac{1}{2}\sum_{i=1}^n v_i (w^Tx_i - y_i)^2 + \frac{\lambda}{2}\norm{w-w^0}^2$ (weighted least squares shrunk towards non-zero $w^0$).
}}
Above we assume that $v$ and $w^0$ are $d$ by $1$ vectors, and $\Lambda$ is a $d$ by $d$ positive-definite matrix. You can use $V$ as a diagonal matrix containing the $v_i$ values along the diagonal.
\subsection{Proving Convexity}
\blu{Show that the following functions are convex using one of the definitions of convexity (without using the ``operations that preserve convexity" or using convexity results stated in class)}:
\begin{center}
\begin{tabular}{lll}
1. Negative logarithm & $f(w) = -\log(aw) $ & $w > 0$\\
2. Quadratic with positive semi-definite $A$ & $f(w) = \half w^TAw + b^Tw + \gamma$ & $w \in \R^d, A \succeq 0$ \\
3. Any norm & $f(w) = \norm{w}_p$ & $w \in \R^d$\\
4. Logistic regression & $f(w) = \sum_{i=1}^n \log(1+\exp(-y_iw^Tx_i))$ & $w \in \R^d$
\end{tabular}
\end{center}
Hint: Norms are not differentiable in general, so you cannot use the Hessian for the third one. For the last one, you can use the Hessian structure we derived in class.
\blu{Use the results above and from class, along with the operations that preserve convexity, to show that the following functions are convex} (with $\lambda \geq 0$):
\begin{center}
\begin{tabular}{lll}
%1. $f(w) = \frac{1}{2}\norm{Xw - y} + \frac{\lambda}{2}\norm{w}_1$ & (LASSO)\\
%4. $f(w) = \sum_{i=1}^N\log(1+\exp(-b_i\red{x}^Ta_i)) + \lambda\norm{x}_1$ & ($\ell_1$-regularized logistic regression)\\
5. $f(w) = \norm{Xw - y}_p + \lambda\norm{Aw}_q$ & (regularized regression with arbitrary $p$-norm and weighted $q$-norm)\\
6. $f(w) = \sum_{i=1}^N\max\{0, |w^Tx_i - y_i| - \epsilon\} + \frac{\lambda}{2}\norm{w}_2^2$ & (support vector regression)\\
7. $f(x) = \max_{ijk}\{|x_i| + |x_j| + |x_k|\}$ & ($3$ largest-magnitude elements)
\end{tabular}
\end{center}
%\textbf{Bonus}: For $x \in \R^d$ and any $k \in \{1,2,\dots,d\}$, show that the sum of the $k$ largest values of $|x_i|$ is convex.
%$f(X) = $ tr$(X^{-1})$ where $X \in \mathbb{S}_{++}^n$ (the set of positive-definite matrices).
\subsection{Robust Regression}
The script \emph{example\_outliers} loads a one-dimensional regression dataset that has a non-trivial number of `outlier' data points. These points do not fit the general trend of the rest of the data, and pull the least squares model away from the main cluster of points. One way to improve the performance in this setting is simply to remove or downweight the outliers. However, in high-dimensions it may be difficult to determine whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable to replace the squared error with an error that is more robust to outliers.
\enum{
\item
Write a new function, \emph{robustRegression(X,y)}, that adds a bias variable and fits a linear regression model by minimizing the absolute error instead of the square error,
\[
f(w) = \norm{Xw- y}_1.
\]
You should turn this into a \emph{linear program} as shown in class, and you can solve this linear program using Matlab's \emph{linprog} function. \blu{Hand in the new function and report the minimum absolute training error that is possible on this dataset}.
\item There have been several attempts to adapt SVMs to the regression problem. The most common method for support vector regression uses what iscalled the $\epsilon$-insensitive loss,
\[
f(w) = \sum_{i=1}^n \max\{0,|w^Tx^i - y^i| - \epsilon\}.
\]
Here, $\epsilon$ is a parameter and notice that the model only penalizes errors larger than $\epsilon$. \blu{Show how to write minimizing this objective function as a linear program.}
\item Write a new function, \emph{svRegression(X,y,epsilon)}, that minimizes the $\epsilon$-insensitive objective. \blu{Hand in the new function and report the absolute training error achieved with $\epsilon = 1$.}.
}
\section{Numerical Optimization}
\subsection{Gradient Descent and Newton's Method}
The function \emph{example\_gradient} loads a simple binary classification dataset, and fits an $\ell_2$-regularized logistic regression model to it using the \emph{findMin} function which implements a simple gradient descent algorithm. The \emph{findMin} function is generic in the sense that it only needs an anonymous function which computes the objective value and gradient given a parameter vector. On each iteration \emph{findMin} uses a backtracking line-search to find a step-size $\alpha$ that satisfies the Armijo ``sufficient decrease" condition. It always tries $\alpha = 1$ first, and whenever the condition is not satisfied it uses ``cubic Hermite polynomial" interpolation to find a smaller value of $\alpha$ to try. It continues running the algorithm until a pre-specified number of iterations are reached or until the norm of the gradient is sufficiently small. On this dataset, the method only requires 9 iterations before it satisfies its optimality condition, although it backtracks 13 times and evaluates the function and gradient a total of 23 times.
\blu{Report the effect on performance (in terms of number of backtracking iterations and total number of iterations) of making the following changes to \emph{findMin}}:
\enum{
\item When backtracking, replacing the cubic-Hermite interpolation with the simpler strategy of dividing $\alpha$ in half (as suggested in the Boyd \& Vandenberghe book).
\item Instead of resetting $\alpha$ to one after the line-search, set it using the Barzilai-Borwein step-size, given by
\[
\alpha \leftarrow -\alpha\frac{v^T\nabla f(w)}{v^Tv},
\]
where $v$ is the new gradient value minus the old gradient value.
\item Fix the step-size $\alpha$ to $1/L$, where $L$ is given by
\[
L = \frac{1}{4}\max\{\text{eig}(X^TX)\} + \lambda,
\]
which is the Lipschitz constant of the gradient.
\item Instead of using the gradient direction, set $d$ to the Newton direction which is given by
\[
d = [\nabla^2 f(w)]^{-1}\nabla f(w).
\]
}
For the Newton direction, you'll need to make a new objective function that returns the Hessian in addition to the function and gradient, and modify \emph{findMin} to use the Hessian.
\subsection{Hessian-Free Newton}
The dataset used in the previous question leads to an easy optimization problem. If you apply the variations in the previous question to the larger dataset contained in \emph{rcv1\_train\_binary.mat} then the performance is very different:
\setcounter{enumi}{-1}
\enum{
\item Without modifications, \emph{findMin} doesn't work since the objective overflows.\footnote{There are two standard solutions to this problem: we could detect the overflow and backtrack out of regions where it overflows, or we could use the \emph{log-sum-exp} trick to evaluate the objective without overflowing.}
\item Step-size halving avoids the overflow, but the method doesn't reaches a reasonable accuracy even after 500 iterations.
\item With the Barzilai-Borwein step-size, it reaches a solution in a reasonable number of iterations.
\item Computing $L$ is slower than solving the original problem because $d$ is so large.\footnote{An often-faster way to compute the largest eigenvalue is the ``power method''.}
\item Computing the Hessian is slower than solving the original problem because $d$ is so large.
}
In Hessian-free Newton methods, we compute an approximate Newton direction $d$ without ever forming the Hessian. The standard way to do this is to use \emph{conjugate gradient} to solve for $d$, which only requires Hessian-vector products of the form $\nabla^2 f(w)v$ for a vector $v$. Note that Hessian-vector products can always be computed at a similar cost to computing the gradient. For example, for logistic regression we can use
\[
\nabla^2 f(w)v = X^TDXv = X^T(D(Xv)),
\]
and the order of operations leads to a cost of $O(nd)$. This is cheaper than the $O(nd^2)$ cost of forming the Hessian. \blu{Use Matlab's \emph{pcg} function to implement a ``Hessian-free" Newton's method, where you use conjugate gradient to solve the Newton system. Report the output of \emph{findMin} on \emph{rcv1\_train\_binary.mat} when using this strategy and using \emph{optTol} as the tolerance for \emph{pcg}.}
Hint: In \emph{logisticL2}, define a function that computes the Hessian-vector product and has the following header:
\begin{verbatim}
function [Hv] = Hvfunc(w,v,X,y,lambda)
\end{verbatim}
To define the \emph{AFUN} argument needed by \emph{pcg} (which must be a function of $v$ for a fixed $w$) you can use an anonymous function:
\begin{verbatim}
Hv = @(v)Hvfunc(w,v,varargin{:});
\end{verbatim}
\subsection{Multi-Class Logistic Regression}
The function \emph{example\_multiClass} loads a multi-class classification dataset and fits a ``one-vs-all'' logistic regression classifier, then reports the validation error and shows a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For example, this classifier never even predicts some of the classes.
Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no attempt to calibrate the columns of the matrix $W$. An alternative to this independent model is to use the softmax probability,
\[
p(y^i | W, x^i) = \frac{\exp(w_{y^i}^Tx^i)}{\sum_{c=1}^k\exp(w_c^Tx^i)}.
\]
Here $c$ is a possible label and $w_{c}$ is column $c$ of $W$. Similarly, $y^i$ is the training label, $w_{y^i}$ is column $y^i$ of $W$. The loss function corresponding to the negative logarithm of the softmax probability is given by
\[
f(W) = \sum_{i=1}^n \left[-w_{y^i}^Tx^i + \log\left(\sum_{c' = 1}^k \exp(w_{c'}^Tx^i)\right)\right].
\]
Make a new function, \emph{softmaxClassifier}, which fits $W$ using the softmax loss from the previous section instead of fitting $k$ independent classifiers. \blu{Hand in the code and report the validation error}.
Hint: you can use the \emph{derivativeCheck} function to help you debug the gradient of the softmax loss. It can also help you numerically check your answer to several more of this assignment's questions.
\end{document}