\documentclass{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{dsfont}
\usepackage{thmtools}
\usepackage{etoolbox}
\usepackage{enumitem}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{xcolor}
\usepackage{verbatim}
\usepackage[colorlinks,linkcolor=red!80!black]{hyperref}
\usepackage[noabbrev,capitalize]{cleveref}
\newlist{qlist}{enumerate}{1}
\setlist[qlist,1]{leftmargin=*, label=\textbf{(\alph*)}, ref={(\alph*)}}
\crefname{qlisti}{part}{parts}
\AtBeginEnvironment{qlist}{\color{blue}}
\crefname{section}{Question}{Questions}
\definecolor{answer}{rgb}{0,.5,0}
\newcommand{\ans}[1]{\par\textcolor{answer}{Answer: #1}}
\newenvironment{answer}{\par\begingroup\color{answer}Answer: }{\endgroup}
\definecolor{points}{rgb}{0.6,0.3,0}
\newcommand{\pts}[1]{\textcolor{points}{[#1~points]}}
\newcommand{\hint}[1]{\textcolor{black!60!white}{\emph{Hint: #1}}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\cH}{\mathcal{H}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\cY}{\mathcal{Y}}
\DeclareMathOperator{\E}{\mathbb{E}}
\newcommand{\op}{\mathit{op}}
\newcommand{\R}{\mathbb{R}}
\DeclareMathOperator{\spn}{span}
\newcommand{\tp}{^\mathsf{T}}
\DeclareMathOperator{\tr}{tr}
\newcommand{\dist}{\mathcal{D}}
\newcommand{\xdist}{\mathcal{D}_x}
\makeatletter
% \abs{} uses regular-sized bars, \abs*{} uses auto-sized ones
\newcommand{\abs}{\@ifstar{\@Abs}{\@abs}}
\newcommand{\@abs}[1]{\lvert #1 \rvert}
\newcommand{\@Abs}[1]{\left\lvert #1 \right\rvert}
\newcommand{\norm}{\@ifstar{\@Norm}{\@norm}}
\newcommand{\@norm}[1]{\lVert #1 \rVert}
\newcommand{\@Norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\ceil}{\@ifstar{\@Ceil}{\@ceil}}
\newcommand{\@ceil}[1]{\lceil #1 \rceil}
\newcommand{\@Ceil}[1]{\left\lceil #1 \right\rceil}
\newcommand{\floor}{\@ifstar{\@Floor}{\@norm}}
\newcommand{\@floor}[1]{\lfloor #1 \rfloor}
\newcommand{\@Floor}[1]{\left\lfloor #1 \right\rfloor}
\makeatother
\declaretheorem[name=Theorem,numberwithin=section]{thm}
\declaretheorem[name=Proposition,sibling=thm]{prop}
\begin{document}
\begin{center}
\Large
CPSC 532S: Assignment 1 --
due Thursday, 20 Jan 2022, 11:59pm
\end{center}
Prepare your answers to these questions using \LaTeX;
hopefully you're reasonably familiar with it,
but if not, try using Overleaf and looking around for tutorials online.
Feel free to ask questions if you get stuck on things on Piazza
(but remove any details about the actual answers to the questions\dots
feel free to make a private post if that's tough).
If you prefer, the \texttt{.tex} source for this file is available on the course website,
and you can put your answers in
\verb|\begin{answ|\verb|er} My answer here... \end{answer}|
% that weird formatting so my answer-stripping script doesn't catch it :p
environments to make them stand out if so;
feel free to delete whatever boilerplate you want (or not, I'm not printing them out).
Or answer in a fresh document;
just make it clear which question you're answering where.
\textbf{If you're submitting to ICML} (or another deadline between January 18th and 31st),
you can instead submit handwritten solutions for this assignment only;
also note that the lowest assignment will be dropped for your final grade,
so you can skip one assignment if you prefer.
You should do this assignment \textbf{alone}, and do the whole thing.
(The first of these instructions, and probably the second as well, will change for future assignments.)
If you look stuff up anywhere other than in SSBD,
please \textbf{cite your sources}:
just say in the answer to that question where you looked.
Please do not look at solution manuals / etc for SSBD,
or look up proofs of the standard results we're proving in \cref{q:pca}.
Submit your answers as a single PDF on Gradescope:
link and login instructions on \href{https://canvas.ubc.ca/courses/83445}{the Canvas site},
which you should now be able to get to even if you're not yet officially enrolled.
You'll be prompted to mark where each question is in your PDF;
make sure you mark all relevant pages for each part.
(This saves me a surprising amount of time in grading,
although it's kind of annoying for this one since \cref{q:pca} has a lot of parts; sorry.)
Please \textbf{put your name on the first page} as a backup, just in case.
If something goes wrong, you can also email your assignment to me directly (\texttt{dsuth@cs.ubc.ca}).
\vspace{3em}
Some quick meta-notes about the questions (feel free not to read):
\Cref{q:shais}, \nameref{q:shais}, lives up to its name;
it's a few questions right out of the SSBD book.
They, hopefully, should help reinforce the basic definitions / etc.
You should be able to do \cref{q:shais:unbiased} and probably \cref{q:shais:circles} now;
material for \cref{q:shais:bayesopt} will be covered in lecture 2.
\Cref{q:pca}, \nameref{q:pca}, is about doing some proofs with linear algebra, which will be important later in the course, but mostly doesn’t have to do with this first week of the course so much.
Only \cref{q:pca:loss} directly has to do with lecture, and will be covered in class 2;
you can do the rest of the question now and come back to that part afterwards.
This question is in here
(i) to remind you how linear algebra works (since this will matter later),
(ii) because I think it’s at a nice level of ``walk you through a meaningful proof in a way where you do still have to think a bit'',
and (iii) because it proves some stuff that I think is nice to know that isn’t covered in CPSC 340.
\clearpage
\section{Some of the Shais' Problems \pts{50}} \label{q:shais}
These problems are from the book of Shalev-Shwartz and Ben-David,
though I tweaked some notation slightly to agree with what we're using in lecture.
For each of these problems, use the 0-1 loss (the misclassification rate, as discussed in lecture 1 / chapter 2 of SSBD).
\begin{qlist}
\item \pts{10} \label{q:shais:unbiased}
\begingroup\color{black}
[SSBD 2.2]
Let $\cH$ be a class of binary classifiers over a domain $\cX$.
Let $\xdist$ be an unknown distribution over $\cX$,
and let $f$ be the target hypothesis in $\cH$.
Fix some $h \in \cH$.
Use the 0-1 loss (the misclassification rate).
\endgroup
Show that the expected value of $L_S(h)$ over the choice of $S$ equals $L_{\xdist,f}(h)$,
namely, $\E[L_S(h)] = L_{\xdist,f}(h)$.
\item \pts{20} \label{q:shais:circles}
\begingroup\color{black}
[SSBD 3.3]
Let $\cX = \R^2$, $\cY = \{0, 1\}$,
and let $\cH$ be the class of concentric circles in the plane
-- that is, $\cH = \{ h_r : r \in \R_+ \}$,
where $h_r(x) = \mathds{1}_{[ \norm x \le r]}$
(a function which is 1 if $\norm x \le r$, 0 otherwise).
\endgroup
Prove that $\cH$ is PAC learnable (assuming realizability),
and its sample complexity is bounded by
$n_{\mathcal H}(\varepsilon, \delta) \le \ceil*{\frac1\varepsilon \log(1 / \delta)}$.
\item \pts{20} \label{q:shais:bayesopt}
\textcolor{black}{[SSBD 3.7]}
Show that for every probability distribution $\dist$,
the Bayes-optimal predictor $f_\dist$ is optimal,
in the sense that for every classifier $g : \cX \to \{0, 1\}$,
$L_\dist(f_\dist) \le L_\dist(g)$.
\end{qlist}
\clearpage
\section{Principal Component Analysis From First Principles \pts{50}} \label{q:pca}
We mentioned in class that the loss minimization framework can also apply to cases without a simple label, such as unsupervised learning.
Let's use this idea to study perhaps the most common unsupervised linear framework: PCA.
We'll introduce it as we go; don't worry if you haven't seen it before.
In PCA, we're given a dataset $X \in \R^{n \times d}$,
which contains $n$ points $x_i \in \R^d$ stored as the rows of $X$.\footnote{
As usual in machine learning, we'll think of these as column vectors, even though they're the rows of $X$.
As a student in CPSC 340 said last term, ``that's dumb,'' but oh well.}
We're trying to find a linear transformation $W \in \R^{k \times d}$,
for $k \le d$,
where $W$ has orthonormal rows ($W W\tp = I_k$),
and we want $W x \in \R^k$ to contain ``as much information as it can'' about $x$.
For instance, we might want to do this with $k = 2$ to plot high-dimensional data.
We can do this to everything in $X$ by taking $X W\tp$;
you should make sure it makes sense to you that this is the same as stacking up $W x_i$ for each $i$.
To map a point $z \in \R^k$ back to $\R^d$,
we just take $W\tp z$ (why?).
Thus projecting and reconstructing the entire dataset is $X W\tp W$.
We'll choose $W$ to minimize the squared reconstruction error on our dataset:
\begin{equation}
\argmin_{W : W W\tp = I_k} \norm{X W\tp W - X}_F^2
\tag{PCA} \label{eq:pca-def}
,\end{equation}
where $\norm{X}_F^2 = \sum_{ij} X_{ij}^2 = \tr(X\tp X)$
is the squared Frobenius norm.
You may find it helpful for this question to recall ``trace rotation'': $\tr(A B) = \tr(B A)$.
\begin{qlist}
\item \label{q:pca:loss} \pts{5}
Frame \eqref{eq:pca-def} as empirical risk minimization using the terminology from class
(also see page 48 of Shai+Shai).
Specifically,
what are the data domain $\mathcal Z$,
the sample $S = (z_1, \dots, z_n) \in \mathcal Z^n$,
the hypothesis class $\mathcal H$,
and the loss function $\ell : \mathcal H \times \mathcal Z \to \R$
such that the set of ERMs is exactly the set of solutions to \eqref{eq:pca-def}?
\end{qlist}
Recall that the sample covariance\footnote{
This is the unbiased sample covariance; the same is true for the biased estimator.}
of points $z_1, \dots, z_n \in \R^d$
is $\frac{1}{n-1} \sum_{i=1}^n \left( z_i - \bar z \right) \left( z_i - \bar z \right)\tp \in \R^{d \times d}$,
where $\bar z = \frac1n \sum_{i=1}^n z_i$ is the sample mean.
Another view of PCA is maximizing the variance of the projected points:
\begin{qlist}[resume*]
\item \label{q:pca:cov} \pts{5}
\begingroup\color{black}
Suppose, for this part only, that $X$ is centred: $\frac1n \mathbf 1_n\tp X = \mathbf 0_d$,
where $\mathbf 1_n \in \R^n$ is an all-ones vector and $\mathbf 0_d \in \R^d$ is an all-zeros vector.
(This is standard in PCA; it gives us more flexibility in the fit.)
\endgroup
Show \eqref{eq:pca-def} is equivalent to maximizing the trace of the sample covariance of $X W\tp$.
\end{qlist}
Even requiring orthonormal $W$, there are many equivalent solutions to \eqref{eq:pca-def}.\footnote{%
Note that
$(R W)\tp (R W) = W\tp (R\tp R) W$
and $(R W) (R W)\tp = R W W\tp R\tp = R R\tp$,
so any unitary matrix (where $R\tp R = I = R R\tp$)
transforms any solution $W$ into an equivalent one.
For instance, you can rotate, permute, or flip signs of the components.}
To reduce the number of valid solutions (and because it has its own advantages),
we usually require that $W$ be consistent with PCA using \emph{sequential fitting}:
first, we solve \eqref{eq:pca-def} for $k = 1$.
Once we've fit on $j-1$ components,
to get the $j$th we remove the already-fit component, $X - X W\tp W$,
and solve \eqref{eq:pca-def} with $j = 1$ on that remainder,
making that the $j$th row of $W$.
We'll prove shortly that this doesn't hurt our reconstruction performance.
Thus, for the next little bit, we're going to think only about the solution for $k = 1$.
For notational convenience, let the vector $w \in \R^d$ be the first (and only) row of $W$,
so that $W = w\tp$.
Plugging in to \cref{q:pca:cov},
we obtain that solutions to \eqref{eq:pca-def}
are exactly the maximizers of $\norm{X w}$. (There's a bit of a hint for \cref{q:pca:cov}!)
What $w$ are these?
We can answer that with the following result,
remembering $\norm{X w}^2 = w\tp (X\tp X) w$:
\begin{prop} \label{thm:rayleigh}
Let $A \in \R^{m \times m}$ have real eigenvalues $\lambda_1 \ge \cdots \ge \lambda_m$,
with corresponding real orthonormal eigenvectors $q_i$.
Let $k$ be the largest index such that $\lambda_1 = \lambda_k$
(i.e.\ the top $k$ eigenvalues are the same).
Then
\[
\max_{v : \norm v = 1} v\tp A v = \lambda_1
;\]
this max is achieved if and only if $v$ is a unit vector in
$\spn(\{ q_1, \dots, q_k \}) = \{ \sum_{i=1}^k \alpha_i q_i : \alpha_i \in \R \}$.
\end{prop}
\begin{proof}
Pick $\alpha \in \R^m$ such that $v = \sum_{i=1}^m \alpha_i q_i$.
\begin{qlist}[resume*]
\item \pts{5} Prove this.
\end{qlist}
\end{proof}
Recall that the singular value decomposition, the SVD,\footnote{%
Here we're using the ``compact'' SVD, which I just find a little more convenient to think about. Everything would be basically the same with the ``full'' SVD.}
of a real rank-$r$ matrix $A \in \R^{a \times b}$
is $A = U \Sigma V\tp = \sum_{i=1}^r \sigma_i u_i v_i\tp$,
where $\Sigma$ is a diagonal matrix with entries $\sigma_1 \ge \cdots \ge \sigma_r > 0$,
$U \in \R^{a \times r}$ has orthonormal columns $u_i \in \R^a$ so that $U\tp U = I_a$,
and $V \in \R^{b \times r}$ has orthonormal columns $v_i \in \R^b$ so that $V\tp V = I_b$.
The $\sigma_i$ are the square roots of the nonzero eigenvalues of $A\tp A$,
which coincide with those of $A A\tp$.
$U$ contains the corresponding eigenvectors of $A A\tp$,
while $V$ has those of $A\tp A$.
Note that the SVD is not unique: you can always flip the signs of both $u_i$ and $v_i$,
and if there are non-distinct singular values, you can rotate the singular vectors in the corresponding subspace.
Let $A_k = \sum_{i=1}^k \sigma_i u_i v_i\tp = U_{1:k} \Sigma_{1:k} V_{1:k}\tp$
be the rank-$k$ approximation obtained by truncating the SVD.
Here $U_{1:k}$ and $V_{1:k}$ take the first $k$ columns of $U$ or $V$.
\begin{qlist}[resume*]
\item \label{q:pca:recon} \pts{10}
Show that the $k$th principal component of $X$,
obtained via the sequential algorithm described above,
is a valid choice for the $k$th right singular vector,
so $W = V_{1:k}\tp$ for one valid choice of $V$.
Furthermore, show $X W\tp W = X_k$.
\end{qlist}
We now know that for any $k$, the PCA reconstructions $X W\tp W$ are equal to $X_k$.
That this is a global minimum for \eqref{eq:pca-def} is a consequence of the following theorem:
\begin{restatable}[Eckart-Young, Frobenius]{thm}{eckfrob} \label{thm:eckart-young:frobenius}
For any matrix $B$ of rank at most $k$,
$\norm{A - A_k}_F \le \norm{A - B}_F$.
\end{restatable}
We're now going to prove this theorem together.
To do it, we'll first prove the version for the spectral norm
(also known as the operator norm).
Recall that the spectral norm is
$\norm{A}_\op = \sup_x \norm{A x} / \norm{x}$,
which \cref{thm:rayleigh} implies is equal to the largest singular value of $A$.
\begin{thm}[Eckart-Young, operator] \label{thm:eckart-young:op}
For any matrix $B$ of rank at most $k$,
$\norm{A - A_k}_\op \le \norm{A - B}_\op$.
\end{thm}
\begin{proof}
Recall that $A$ is $a \times b$ and of rank $r$.
Assume, without loss of generality, that $k < r \le b \le a$.
\begin{qlist}[resume*]
\item \pts{3} Explain why this doesn't lose generality.
\item \pts{2} Show that $\norm{A - A_k}_\op = \sigma_{k+1}$.
\hint{You probably already basically did this as part of \cref{q:pca:recon}.}
\end{qlist}
Now, assume for the sake of contradiction that there is some $B$ of rank at most $k$ with $\norm{A - B}_\op < \sigma_{k+1}$.
\begin{qlist}[resume*]
\item \label{q:pca:null-B} \pts{5}
Let $y$ be in the null space of $B$, i.e.\ $B y = 0$, with $y \ne 0$.
Show $\norm{A y} < \sigma_{k+1} \norm y$.
\item \label{q:pca:span-top} \pts{5}
Let $z \in \spn(\{ v_1, \dots, v_{k+1} \})$.
Show $\norm{A z} \ge \sigma_{k+1} \norm z$.
\item \pts{5}
Argue that this is a contradiction.
\hint{Try a dimension-counting argument.}
\end{qlist}
\end{proof}
Okay -- almost done!
The theorem we actually wanted to show was:
\eckfrob*
\begin{proof}
Let $B$ be any $a \times b$ matrix of rank at most $k$.
\begin{qlist}[resume*]
\item \pts{5} Prove the theorem.
\hint{
Try using \cref{thm:eckart-young:op},
recalling that
\[
\norm{A}_F^2
= \tr(A\tp A)
= \tr(V \Sigma U\tp U \Sigma V\tp)
= \tr(\underbrace{V\tp V}_I \Sigma \underbrace{U\tp U}_I \Sigma)
= \tr(\Sigma^2)
= \sum_{i=1}^r \sigma_i^2
.\]}
\end{qlist}
\end{proof}
Alternatively, if you want -- it's basically the same proof -- you could show instead:
\begin{thm}[Eckart-Young, more general]
Let $f(A) = f(\sigma_1, \dots, \sigma_{\max(a, b)})$
be a function of the singular values of $A$ (where $\sigma_i = 0$ if $i > r$).
Suppose that if $\sigma_i' \ge \sigma_i$ for all $i$,
then $f(\sigma_1', \dots, \sigma_{\max(a, b)}') \ge f(\sigma_1, \dots, \sigma_{\max(a, b)})$.
Then, for all matrices $B$ of rank at most $k$, $f(A - A_k) \le f(A - B)$.
\end{thm}
In particular, this will include all \emph{unitarily invariant} matrix norms,
i.e.\ matrix norms such that $\norm{R A S} = \norm{A}$ for all unitary matrices $R$ and $S$.
\end{document}