\documentclass{article}
\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{nicefrac}
\usepackage{listings} % For displaying code
\usepackage{algorithm2e} % pseudo-code
% Answers
\def\ans#1{\par\gre{Answer: #1}}
% Colors
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}
% Math
\def\R{\mathbb{R}}
\def\argmax{\mathop{\rm arg\,max}}
\def\argmin{\mathop{\rm arg\,min}}
\newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\alignStar}[1]{\begin{align*}#1\end{align*}}
\def\half{\frac 1 2}
% LaTeX
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{a2f/#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{a2f/#2}\end{center}}
\newcommand{\matCode}[1]{\lstinputlisting[language=Matlab]{a2f/#1.m}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}
\begin{document}
\title{CPSC 340 Assignment 2 (due Friday October 13 ATE)}
\author{}
\date{}
\maketitle
\vspace{-4em}
\section{Random Forests}
\subsection{Implementation}
Thefile \emph{vowels.jld} contains a supervised learning dataset where we are trying to predict which of the 11 ``steady-state'' English vowels that a speaker is trying to pronounce.
You are provided with a \texttt{decisionTree} as well as a \texttt{randomTree} function in \emph{decisionTree.jl} (both based on information gain). The random tree model differs from the decision tree model in two ways:
it takes a bootstrap sample of the data before fitting and when fitting individual stumps it only considers $\lfloor \sqrt{d} \rfloor$ randomly-chosen features\footnote{The notation $\lfloor x\rfloor$ means the ``floor'' of $x$, or ``$x$ rounded down''.}
In other words, \texttt{RandomTree} is the model we discussed in class that is combined to make up a random forest.
If you run \emph{example\_randomTree.jl}, it will fit both models to the dataset, and you will notice that it overfits badly.
\blu{
\enum{
\item If you set the \emph{depth} parameter to \emph{Inf}, why do the training functions terminate?
\item Why doesn't the random tree model have a training error of 0?
\item Create a function \texttt{randomForest} that takes in hyperparameters \texttt{depth} and \texttt{nTrees} (number of trees), and
fits \texttt{nTrees} random trees each with maximum depth \texttt{depth}. For prediction, have all trees predict and then take the mode. Hand in your function. Hint: you can define an array for holding 10 \emph{GenericModel} types using:\\
\texttt{subModels = Array\{GenericModel\}(10)}.
\item Using 50 trees, and a depth of $\infty$, report the training and testing error. Compare this to what we got with a single \texttt{DecisionTree} and with a single \texttt{RandomTree}. Are the results what you expected? Discuss.
}
}
\subsection{Very-Short Answer Questions}
\blu{\enum{
\item What is a a disadvantage of using a very-large number of trees in a random forest classifier?
\item Your random forest classifier has a training error of 0 and a very high test error. Which ones of the following could help performance?
\enum{
%\item Increase the number of trees in the forest to improve test accuracy.
%\item Decrease the number of trees, since they are giving redundant labels.
\item Increase the maximum depth of the trees in your forest.
\item Decrease the maximum depth of the trees in your forest.
\item Increase the amout of data you consider for each tree (Collect more data and use 2n objects instead of n).
\item Decrease the amount of data you consider for each tree (Use 0.8n objects instead of n).
\item Increase the number of features you consider for each tree.
\item Decrease the number of features you consider for each tree.
}
\item Suppose that you were training on raw audio segments and trying to recognize vowel sounds. What could you do to encourage the final classifier to be invariant to translation?
}
}
\section{K-Means Clustering}
If you run the function \emph{example\_Kmeans}, it will load a dataset with two features and a very obvious clustering structure. It will then apply the $k$-means algorithm with a random initialization. The result of applying the algorithm will thus depend on the randomization, but a typical run might look like this:
%\centerfig{.5}{kmeans.png}
(Note that the colours are arbitrary due to the label switching problem.)
But the `correct' clustering (that was used to make the data) is something more like this:
%\centerfig{.5}{kmeans2.png}
\subsection{Selecting among k-means Initializations}
If you run the demo several times, it will find different clusterings. To select among clusterings for a \emph{fixed} value of $k$, one strategy is to minimize the sum of squared distances between examples $x_i$ and their means $w_{y_i}$,
\[
f(w_1,w_2,\dots,w_k,y_1,y_2,\dots,y_n) = \sum_{i=1}^n \norm{x_i - w_{y_i}}_2^2 = \sum_{i=1}^n \sum_{j=1}^d (x_{ij} - w_{y_ij})^2.
\]
where $y_i$ is the index of the closest mean to $x_i$. This is a natural criterion because the steps of k-means alternately optimize this objective function in terms of the $w_c$ and the $y_i$ values.
\blu{\enum{
\item Write a new function called \emph{kMeansError} that a dataset $X$, a set of cluster assignments $y$, and a set of cluster means $W$, and computes this objective function. Hand in your code.
\item Instead of printing the number of labels that change on each iteration, what trend do you observe if you print the value of \emph{kMeansError} after each iteration of the k-means algorithm?
\item Using the \emph{clustering2Dplot} file, output the clustering obtained by running k-means 50 times (with $k=4$) and taking the one with the lowest error. Note that the k-means training function will run much faster if you set \texttt{doPlot = false} or just remove this argument.
}}
\subsection{Selecting $k$ in k-means}
We now turn to the task of choosing the number of clusters $k$.
\blu{\enum{
\item Explain why the \emph{kMeansError} function should not be used to choose $k$.
\item Explain why even evaluating the \emph{kMeansError} function on test data still wouldn't be a suitable approach to choosing $k$.
\item Hand in a plot of the minimum error found across 50 random initializations, as you vary $k$ from $1$ to $10$.
\item The \emph{elbow method} for choosing $k$ consists of looking at the above plot and visually trying to choose the $k$ that makes the sharpest ``elbow" (the biggest change in slope). What values of $k$ might be reasonable according to this method? Note: there is not a single correct answer here; it is somewhat open to interpretation and there is a range of reasonable answers.
}}
\subsection{$k$-Medians}
The data in \emph{clusterData2.mat} is the exact same as the above data, except it has 4 outliers that are very far away from the data.
\blu{\enum{
\item Using the \emph{clustering2Dplot} function, output the clustering obtained by running k-means 50 times (with $k=4$) on \emph{clusterData2.mat} and taking the one with the lowest error. Are you satisfied with the result?
\item What values of $k$ might be chosen by the elbow method for this dataset?
\item Implement the $k$-\emph{medians} algorithm, which assigns examples to the nearest $w_c$ in the L1-norm and to updates the $w_c$ by setting them to the ``median" of the points assigned to the cluster (we define the $d$-dimensional median as the concatenation of the median of the points along each dimension). Hand in your code and plot obtained with 50 random initializations for $k = 4$.
\item Using the L1-norm version of the error (where $y_i$ now represents the closest median in the L1-norm),
\[
f(w_1,w_2,\dots,w_k,y_1,y_2,\dots,y_n) = \sum_{i=1}^n \norm{x_i - w_{y_i}}_1 = \sum_{i=1}^n \sum_{j=1}^d |x_{ij} - w_{y_ij}|,
\]
what value of $k$ would be chosen by the elbow method under this strategy? Are you satisfied with this result?
}
}
\subsection{Very-Short Answer Questions}
\blu{
\enum{
\item Does the standard k-means clustering algorithm always yield the optimal clustering solution for a given k?
\item If your set out to minimize the distance between each point and its mean in a $k$-means clustering, what value of $k$ minimizes this cost? Is this value useful?
\item Describe a dataset with $k$ clusters where k-means would not be able to find the true clusters.
}}
\section{More Unsupervised Learning}
\subsection{Density-Based Clustering}
If you run the function \emph{example\_dbCluster}, it will apply the basic density-based clustering algorithm to the dataset from the previous part. The final output should look like this:\\
%\fig{.49}{density}\fig{.49}{density2}\\
(The right plot is zoomed in to show the non-outlier part of the data.)
Even though we know that each object was generated from one of four clusters (and we have 4 outliers), the algorithm finds 6 clusters and does not assign some of the original non-outlier objects to any cluster. However, the clusters will change if we change the parameters of the algorithm. Find and report values for the two parameters (\emph{radius} and \emph{minPts}) such that the density-based clustering method finds:
\blu{\enum{
\item The 4 ``true" clusters.
\item 3 clusters (merging the top two, which also seems like a reasonable interpretaition).
\item 2 clusters.
\item 1 cluster (consisting of the non-outlier points).
}
}
\subsection{Vector Quantization}
Discovering object groups is one motivation for clustering. Another motivation is \emph{vector quantization}, where we find a prototype point for each cluster and replace points in the cluster by their prototype. If our inputs are images, we could use vector quantization on the set of RGB pixel values as a simple image compression algorithm.
Your task is to implement this simple image compression algorithm by writing a \texttt{quantizeImage} and a \texttt{deQuantizeImage} function. The \texttt{quantizeImage} function should take the name of an image file (like ``dog.png'' for the provided image) and a number $b$ as input. It should use the pixels in the image as examples and the 3 colour channels as features, and run $k$-means clustering on this data with $2^b$ clusters. The code should store the cluster means and return four arguments: the cluster assignments $y$, the means $W$, the number of rows in the image $nRows$, and the number of columns $nCols$. The \texttt{deQuantizeImage} function should take these four arguments and return a version of the image (the same size as the original) where each pixel's original colour is replaced with the nearest prototype colour.
To understand why this is compression, consider the original image space. Say the image can take on the values $0,1,\ldots,254,255$ in each colour channel. Since $2^8=256$ this means we need 8 bits to represent each colour channel, for a total of 24 bits per pixel. Using our method, we are restricting each pixel to only take on one of $2^b$ colour values. In other words, we are compressing each pixel from a 24-bit colour representation to a $b$-bit colour representation by picking the $2^b$ prototype colours that are ``most representative'' given the content of the image. So, for example, if $b=6$ then we have 4x compression.
Note: if you install the ``Images'' package then you can read in the image using the ``imread'' function (it takes a file name and returns a $nRows$ by $nCols$ by $3$ array containing the images RGB values). Similarly, the ``imshow'' function can display an image represented in this format. You may find it help to use the ``reshape'' function.
\blu{\enum{
\item Hand in your \emph{quantizeImage} and \emph{deQuantizeImage} functions.
\item Show the image obtained if you encode the colours using $1$, $2$, $4$, and $6$ bits per pixel (instead of the original 24-bits).
}}
\subsection{Very-Short Answer Questions}
\blu{
\enum{
\item Suppose that you had only two features and that they have very-different scales (like kilograms vs. milligrams). How would this affect the result of density-based clustering?
\item Name a key advantage and drawback of using a supervised outlier detection method rather than an unsupervised method?
\item Given an $n \times 2$ matrix $X$ and a test query $\hat{x}$, what is the cost of finding all rows $i$ in $X$ where $\norm{x_i - \hat{x}} \leq r$ for some $r > 0$? How does this cost change if I give you a hash table that assigns rows of $X$ to keys that divide the space into a 2D grid of squares with radius $r$, if we use $k$ to denote the maximum number of points hashed to the same key value?
}}
\section{Matrix Notation and Linear Regression}
\subsection{Converting to Matrix/Vector/Norm Notation}
Using our standard supervised learning notation ($X$, $y$, $w$)
express the following functions in terms of vectors, matrices, and norms (there should be no summations or maximums).
\blu{\enum{
\item $\sum_{i=1}^n |w^Tx_i - y_i|$.
\item $\max_{i \in \{1,2,\dots,n\}} |w^Tx_i - y_i| + \frac{\lambda}{2}\sum_{j=1}^n w_j^2$.
\item $\sum_{i=1}^n v_i (w^Tx_i - y_i)^2 + \lambda \sum_{j=1}^{d} |w_j|$.
}}
You can use $V$ to denote a diagonal matrix that has the values $v_i$ along the diagonal.
\subsection{Minimizing Quadratic Functions as Linear Systems}
Write finding a minimizer $w$ of the functions below as a system of linear equations (using vector/matrix notation and simplifying as much as possible). Note that all the functions below are convex so finding a $w$ with $\nabla f(w) = 0$ is sufficient to minimiize the functions (but show your work in getting to this point).
\blu{\enum{
\item $f(w) = \frac{1}{2}\norm{w-u}^2$.
\item $f(w) = \frac{1}{2}\norm{w}^2 + w^TX^Ty$ .
\item $f(w)= \frac{1}{2}\norm{Xw - y}^2 + \frac{1}{2}w^T\Lambda w$.
\item $f(w) = \frac{1}{2}\sum_{i=1}^n v_i (w^Tx_i - y_i)^2$.
}}
Above we assume that $u$ is a $d$ by $1$ vector, and $\Lambda$ is a $d$ by $d$ diagonal matrix with positive entries along the diagonal.
Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right dimensions.
%In class we discuss fitting a linear regression model by minimizing the squared error.
%This classic model is the simplest version of many of the more complicated models we will discuss in the course. However, it typically performs very poorly in practice. One of the reasons it performs poorly is that it assumes that the target $y_i$ is a linear function of the features $x_i$ with an intercept of zero. This drawback can be addressed by adding a bias variable and using nonlinear bases (although nonlinear bases may increase to over-fitting).
%In this question, you will start with a data set where least squares performs poorly. You will then explore how adding a bias variable and using nonlinear (polynomial) transforms can drastically improve the performance. You will also explore how the complexity of a basis affects both the training error and the test error. In the final part of the question, it will be up to you to design a basis with better performance than polynomial bases. If you are not familiar with Matlab, to get you started please see the notes on Matlab commands on the course webpage.
\subsection{Linear Regresion with Bias Variable}
If you run the script \emph{example\_nonLinear}, it will:
\enum{
\item Load a one-dimensional regression dataset.
\item Fit a least-squares linear regression model.
\item Report the training error.
\item Report the test error (on a dataset not used for training).
\item Draw a figure showing the training data and what the linear model looks like.
}
Unfortunately, this is an awful model of the data. The average squared training error on the data set is over 28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually nowhere near the training data:
%\centerfig{.5}{leastSquares}
The y-intercept of this data is clearly not zero (it looks like it's closer to $200$), so we should expect to improve performance by adding a \emph{bias} variable, so that our model is
\[
y_i = w^Tx_i + w_0.
\]
instead of
\[
y_i = w^Tx_i.
\]
\blu{Write a new function, \emph{leastSquaresBias}, that has the same input/model/predict format as the \emph{leastSquares} function, but that adds a \emph{bias} variable $w_0$. Hand in your new function, the updated plot, and the updated training/test error.}
Hint: recall that adding a bias $w_0$ is equivalent to adding a column of ones to the matrix $X$. Don't forget that you need to do the same transformation in the \emph{predict} function.
\subsection{Linear Regression with Polynomial Basis}
Adding a bias variable improves the prediction substantially, but the model is still problematic because the target seems to be a \emph{non-linear} function of the input. Write a new function, \emph{leastSquaresBasis(x,y,p)}, that takes a data vector $x$ (i.e., assuming we only have one feature) and the polynomial order $p$. The function should perform a least squares fit based on a matrix $Z$ where each of its rows contains the values $(x_{i})^j$ for $j=0$ up to $p$. E.g., \emph{leastSquaresBasis(x,y,3)} should form the matrix
\[
Z =
\left[\begin{array}{cccc}
1 & x_1 & (x_1)^2 & (x_1)^3\\
1 & x_2 & (x_2)^2 & (x_2)^3\\
\vdots\\
1 & x_n & (x_n)^2 & (x_N)^3\\
\end{array}
\right],
\]
and fit a least squares model based on it.
\blu{Hand in the new function, and report the training and test error for $p = 0$ through $p= 10$. Explain the effect of $p$ on the training error and on the test error.}
Note: for this question we'll assume $d=1$ (we'll discuss polynomial bases with more input features later in the course).
Hints: To keep the code simple and reduce the chance of having errors, you may want to write a new function \emph{polyBasis} that you can use for transforming both the training and testing data.
\subsection{Manual Search for Optimal Basis}
Polynomials are a flexible class of functions, but there is structure in this data that is not well-modelled by polynomials. Try to find a nonlinear basis that gives the best performance on this dataset in terms of test error. \blu{Report the basis that you use and the training/test score that you achieve}.
Hint: the data seems to have periodic behaviour, and it's possible to obtain training and test errors below 60.
\subsection{Very-Short Answer Questions}
\blu{
\enum{
\item In this question, why are we computing the squared error $(y_i - \hat{y}_i)^2$ and not testing the equality $(y_i = \hat{y}_i)$?
\item Describe a simple 2-feature ($d=2$) case where the least squares estimate would not be unique.
\item What is the computational complexity of computing the closed-form (exact) solution to a linear least squares problem where we have one feature ($d = 1$) and use polynomial basis of degree $p$?
\item In what circumstance would a regression tree with linear regressions at the leaves be a better choice than a linear least squares regression model?
}}
\section{Robust Regression and Gradient Descent}
The script \emph{example\_outliers} loads a one-dimensional regression dataset that has a non-trivial number of `outlier' data points. These points do not fit the general trend of the rest of the data, and pull the least squares model away from the main downward trend that most data points exhibit:
%\centerfig{.7}{outliers}
\subsection{Weighted Least Squares in One Dimension}
One of the most common variations on least squares is \emph{weighted} least squares. In this formulation, we have a weight $v_i$ for every training example. To fit the model, we minimize the weighted squared error,
\[
f(w) = \frac{1}{2}\sum_{i=1}^n v_i(w^Tx_i - y_i)^2.
\]
In this formulation, the model focuses on making the error small for examples $i$ where $v_i$ is high. Similarly, if $v_i$ is low then the model allows a larger error.
Write a model function, \emph{weightedLeastSquares(X,y,v)}, that implements this model (note that a previous question asks you to show how this formulation can be solved as a linear system).
Apply this model to the data containing outliers, setting $v_i = 1$ for the first $400$ data points and $v_i = 0.1$ for the last $100$ data points (which are the outliers). \blu{Hand in your function and the updated plot}.
\subsection{Smooth Approximation to the L1-Norm}
Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that is more robust to outliers. In class, we discussed using the sum of absolute values objective,
\[
f(w) = \sum_{i=1}^n |w^Tx_i - y_i|.
\]
This is less sensitive to outliers than least squares, but it is non-differentiable and harder to optimize. Nevertheless, there are various smooth approximations to the absolute value function that are easy to optimize. One possible approximation is to use the log-sum-exp approximation of the max function\footnote{Other possibilities are the Huber loss, $|r| \approx \sqrt{r^2 + \epsilon}$ for some small $\epsilon$.}
\[
|r| \approx \log(\exp(r) + \exp(-r)).
\]
%for some parameter $\alpha$. This approximation becomes exact as $\alpha$ goes to $\infty$, but for any fixed $\alpha$ the function will be differentiable.
Using this approximation, we obtain an objective of the form
\[
f(w) = \sum_{i=1}^n \log\left(\exp(w^Tx_i - y_i) + \exp(y_i - w^Tx_i)\right).
\]
which is smooth but less sensitive to outliers than the squared error. \blu{Derive
the gradient $\nabla f$ of this function with respect to $w$. You should show your work but you do not have to express the final result in matrix notation.}
\subsection{Robust Regression}
The function \emph{example\_gradient} is the same as \emph{example\_outlier}, except that it fits the least squares model using a \emph{gradient descent} method. You'll see that it produces the same fit as we obtained using the normal equations.
%One advantage of this strategy is that it only costs $O(nd)$ for an iteration of the gradient method, which is faster than forming $X^TX$ which costs $O(nd^2)$. Of course, we need to know the \emph{number} of gradient iterations in order to precisely compare these two strategies, but for now we will assume that the number of gradient iterations is typically often reasonable.
The typical input to a gradient method is a function that, given $w$, returns $f(w)$ and $\nabla f(w)$. See \emph{funObj} in the \emph{leastSquaresGradient} function for an example. Note that \emph{leastSquaresGradient} also has a numerical check that the gradient code is approximately correct, since implementing gradients is often error-prone.\footnote{Though sometimes the numerical gradient checker itself can be wrong. For a lot more on numerical differentiation you can take CPSC 303.}
An advantage of gradient-based strategies is that they are able to solve problems that do not have closed-form solutions, such as the formulation from the previous section. The function \emph{robustRegression} has most of the implementation of a gradient-based strategy for fitting the robust regression model under the log-sum-exp approximation. The only part missing is the function and gradient calculation inside the \emph{funObj} code. \blu{Modify this function to implement the objective function and gradient based on the smooth approximation to the absolute value function (from the previous section). Hand in your code, as well as the plot obtained using this robust regression appraoch.}
\subsection{Very-Short Answer Questions}
\blu{
\enum{
\item In class we considered 4 general strategies for outlier detection (model-based, graph-based, cluster-based, distance-based). Pick two of these and describe whether they would be effective for detecting the outliers in this dataset.
\item When should we consider using gradient descent to approximate the solution to the least squares problem instead of exactly solving it with the closed form solution?
\item Why are we smoothing the absolute value? Why can't we just set the gradient to 0 and solve a linear system?
}}
\end{document}