\documentclass{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{csquotes}
\usepackage{dsfont}
\usepackage{thmtools}
\usepackage{etoolbox}
\usepackage{enumitem}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{xcolor}
\usepackage{verbatim}
\usepackage[colorlinks,linkcolor=red!80!black]{hyperref}
\usepackage[noabbrev,capitalize]{cleveref}
\definecolor{question}{rgb}{0,0,1}
\crefname{section}{Question}{Questions}
\newlist{qlist}{enumerate}{1}
\setlist[qlist,1]{leftmargin=*, label=\textbf{[\thesection.\arabic*]}, ref={[\thesection.\arabic*]}}
\crefname{qlisti}{Question}{Questions}
\AtBeginEnvironment{qlist}{\color{question}}
\definecolor{answer}{rgb}{0,.5,0}
\newcommand{\ans}[1]{\par\textcolor{answer}{Answer: #1}}
\newenvironment{answer}{\par\begingroup\color{answer}Answer: }{\endgroup}
\definecolor{points}{rgb}{0.6,0.3,0}
\newcommand{\pts}[1]{\textcolor{points}{[#1~points]}}
\newcommand{\hint}[1]{\textcolor{black!60!white}{\emph{Hint: #1}}}
\newcommand{\meta}[1]{\textcolor{black!60!white}{\emph{#1}}}
\newcommand{\TODO}{\color{red}{TODO}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\bsigma}{\boldsymbol{\sigma}}
\DeclareMathOperator{\bigO}{\mathcal{O}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\cB}{\mathcal{B}}
\newcommand{\cF}{\mathcal{F}}
\newcommand{\cH}{\mathcal{H}}
\newcommand{\cI}{\mathcal{I}}
\newcommand{\cP}{\mathcal{P}}
\newcommand{\cQ}{\mathcal{Q}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\cY}{\mathcal{Y}}
\newcommand{\cZ}{\mathcal{Z}}
\DeclareMathOperator{\conv}{conv}
\DeclareMathOperator*{\E}{\mathbb{E}}
\DeclareMathOperator{\ERM}{ERM}
\newcommand{\eps}{\varepsilon}
\newcommand{\indic}{\mathds{1}}
\DeclareMathOperator{\MMD}{MMD}
\newcommand{\NN}{\mathbb{N}}
\newcommand{\op}{\mathit{op}}
\newcommand{\R}{\mathbb{R}}
\DeclareMathOperator{\Rad}{Rad}
\newcommand{\SG}{\mathcal{SG}}
\DeclareMathOperator{\sgn}{sgn}
\DeclareMathOperator{\spn}{span}
\newcommand{\tp}{^\mathsf{T}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\Unif}{Unif}
\DeclareMathOperator{\VCdim}{VCdim}
\newcommand{\dist}{\mathcal{D}}
\newcommand{\xdist}{\mathcal{D}_x}
%%begin novalidate % overleaf code check gives false positives here
\makeatletter
% \abs{} uses regular-sized bars, \abs*{} uses auto-sized ones
\newcommand{\abs}{\@ifstar{\@Abs}{\@abs}}
\newcommand{\@abs}[1]{\lvert #1 \rvert}
\newcommand{\@Abs}[1]{\left\lvert #1 \right\rvert}
\newcommand{\norm}{\@ifstar{\@Norm}{\@norm}}
\newcommand{\@norm}[1]{\lVert #1 \rVert}
\newcommand{\@Norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\ceil}{\@ifstar{\@Ceil}{\@ceil}}
\newcommand{\@ceil}[1]{\lceil #1 \rceil}
\newcommand{\@Ceil}[1]{\left\lceil #1 \right\rceil}
\newcommand{\floor}{\@ifstar{\@Floor}{\@norm}}
\newcommand{\@floor}[1]{\lfloor #1 \rfloor}
\newcommand{\@Floor}[1]{\left\lfloor #1 \right\rfloor}
\newcommand{\inner}{\@ifstar{\@Inner}{\@inner}}
\newcommand{\@inner}[2]{\langle #1, #2 \rangle}
\newcommand{\@Inner}[2]{\left\langle #1, #2 \right\rangle}
\makeatother
%%end novalidate
\newcommand{\lec}[2]{\href{https://cs.ubc.ca/~dsuth/532D/23w1/notes/#1.pdf}{#2}}
\declaretheorem[name=Theorem,numberwithin=section]{thm}
\declaretheorem[name=Proposition,sibling=thm]{prop}
\declaretheorem[name=Lemma,sibling=thm]{lemma}
\begin{document}
\begin{center}
\Large
CPSC 532D, Fall 2023: Assignment 4
\\
due Wednesday December 20th, \textbf{11:59 pm}
\end{center}
\meta{%
Use \LaTeX, like usual.
}
You can do this with a partner if you'd like (there's a ``find a group'' post on Piazza).
If so, \textbf{do not just split the questions up};
if you hand in an assignment with your name on it,
you're pledging that you participated in and understand all of the solutions.
(If you work with a partner on some problems and then end up doing some of them separately,
hand in separate answers and put a note in each question saying whether you did it with a partner or not.)
\meta{%
If you look stuff up anywhere other than in SSBD or MRT,
\textbf{cite your sources}:
just say in the answer to that question where you looked.
If you ask anyone else for help, \textbf{cite that too}.
Please do not look at solution manuals / search for people proving the things we're trying to prove / etc.
Also, please do \emph{not} ask ChatGPT or similar models.
It's okay to talk to others outside your group about general strategies -- if so, just say who and for which questions --
but \textbf{not} to sit down and do the assignment together.
}
\meta{%
Submit your answers as a single PDF on Gradescope:
\href{https://canvas.ubc.ca/courses/123461/external_tools/46711?display=borderless}{here's the link}.
}
Make sure to use the Gradescope group feature if you're working in a group.
\meta{%
You'll be prompted to mark where each question is in your PDF;
make sure you mark all relevant pages for each part (which saves a surprising amount of grading time).
}
Please \textbf{put your name(s) on the first page} as a backup, just in case.
If something goes wrong, you can also email your assignment to me directly (\texttt{dsuth@cs.ubc.ca}).
\clearpage
\section{Expectation bounds imply PAC-learning \pts{10}} \label{q:pac-and-exp}
\meta{Our SGD bound, as well as the stability bound that we actually proved (not the one relying on appealing to a complicated proof we didn't cover), only showed learning in expectation. This problem establishes that this is equivalent to PAC learning, albeit maybe with a bad rate.}
Let $\cA$ be a learning algorithm, $\dist$ a probability distribution,
and $\ell$ a loss function bounded in $[0, 1]$.
For brevity's sake, let $L$ be the random variable $L_\dist(\cA(S))$.
\textcolor{question}{Prove that the following two statements are equivalent:
\begin{enumerate}
\item There is some $m(\eps, \delta)$
such that for every $\eps, \delta \in (0, 1)$, for all $m \ge m(\eps, \delta)$,
$\Pr_{S \sim \dist^m}(L > \eps) < \delta$.
\item $\cA$'s expected loss is asymptotically zero: $\lim_{m \to \infty} \E_{S \sim \dist^m} L = 0$.
\end{enumerate}
}
\begin{answer}\TODO\end{answer}
\clearpage
\section{A really hard convex-Lipschitz-bounded problem \pts{15}}
\meta{%
Recall that we showed in class that regularized loss minimization
can learn any convex-Lipschitz-bounded problem:
if $h \mapsto \ell(h, z)$ is convex and $\rho$-Lipschitz for each $z \in \cZ$,
$\cH$ is convex, and there is some strongly convex function $R(h)$ -- e.g.\ $R(h) = \frac12 \norm{h}^2$ -- such that $R(h^*) \le \frac12 B^2$,
then regularized loss minimization with the right choice of regularization weight
can find $\hat h$ such that $L_\dist(\hat h) \le L_\dist(h^*) + \bigO(1 / \sqrt m)$,
either appealing to the complicated paper or by our expectation bound plus \cref{q:pac-and-exp}.
We also showed that in this setting,
gradient descent can implement regularized loss minimization
up to $\eps$ accuracy using $\bigO(1 / \eps^2)$ gradient steps.\footnote{You can actually show $\bigO(1/\eps)$; we didn't assume strong convexity in our bound.}
Thus, any convex-Lipschitz-bounded problem can be PAC-learned in polynomially many gradient steps.
}
\meta{%
This \emph{doesn't} guarantee that convex-Lipschitz-bounded problems can be efficiently learned.
}
Let $\cH = [0, 1]$ -- nice and simple -- but let the example domain $\cZ$ be the class of all pairs of Turing machines $T$ and input strings $s$.
Define
\[
\ell(h, (T, s)) = \begin{cases}
\mathds{1}(\text{$T$ halts on the input $s$}) & \text{if } h = 0 \\
\mathds{1}(\text{$T$ does not halt on the input $s$}) & \text{if } h = 1 \\
(1-h) \ell(0, (T, s)) + h \ell(1, (T, s)) & \text{if } 0 < h < 1
.\end{cases}
\]
\textcolor{question}{Prove that this problem is convex-Lipschitz-bounded, but no computable algorithm can PAC-learn it.}
\hint{Think about what the loss minimizer $h^*$, or the ERM, represents with this loss.}
\hint{If you have no idea what I'm talking about: look up the ``halting problem.''}
\begin{answer}\TODO\end{answer}
\clearpage
\section{Learning without concentration \pts{25}}
We're going to do an unsupervised learning task,
where we try to estimate the mean of a distribution,
but we do it with some \emph{missing} observations.
Specifically,
let $\cB$ be the closed unit ball $\cB = \{ w \in \R^d : \norm w \le 1 \}$,
and let the samples be in
$\cZ = \cB \times \{0, 1\}^d$,
where an entry $z = (x, \alpha)$ with $\alpha$ is a binary ``mask'' vector indicating whether the given entry is missing.
We want to estimate the mean ignoring the missing entries,
i.e.\ $\cH = \cB$ and
\[
\ell(w, (x, \alpha)) = \sum_{i=1}^d \begin{cases}
0 & \text{if } \alpha_i = 1 \\
(x_i - w_i)^2 & \text{if } \alpha_i = 0
.\end{cases}
\]
\begin{qlist}
\item \pts{10} Show that regularized loss minimization can PAC-learn this problem with a sample complexity independent of $d$.
\hint{Feel free to use the result of \cref{q:pac-and-exp} and results from class.}
\begin{answer}\TODO\end{answer}
\item \pts{10}
\begingroup\color{black}
Let $\dist$ be a distribution where $x$ is always the fixed vector $0$, and $\alpha$ has its entries i.i.d.\ $\Unif(\{0, 1\}) = \operatorname{Bernoulli}(1/2)$.
Let $m_\dist(\eps, \delta)$ denote the sample complexity of uniform convergence for this $\dist$,
so that if $m \ge m_\dist(\eps, \delta)$, then
\[
\Pr_{S \sim \dist^m}\left( \sup_{w \in \cH} L_\dist(w) - L_S(w) \le \eps \right) \ge 1 - \delta
.\]
\endgroup
Show that for some particular value of $\eps > 0$ and $\delta > 0$, $m_\dist(\eps, \delta)$ increases with $d$.
% Show that for any $n$,
% \[ \lim_{d \to \infty} \E_{S \sim \dist^n} \sup_{h \in \cH} L_\dist(h) - L_S(h) = \frac12 .\]
\hint{Show that if $d$ is large enough relative to $m$, you're likely to get at least one dimension $j$ where $(\alpha_i)_j = 1$ for \emph{all} your $m$ samples $x_i \in S_x$.}
\begin{answer}\TODO\end{answer}
\item \pts{5} Describe a problem where RLM is a PAC learner, but uniform convergence doesn't hold. Why doesn't this contradict the fundamental theorem of statistical learning?
\begin{answer}\TODO\end{answer}
\end{qlist}
\clearpage
\section{Maximizing differences \pts{40 + 5 challenge}} \label{q:max-diffs}
Let's consider learning a kernel classifier with the somewhat unusual \emph{linear loss},
$\ell(h, (x, y)) = -y h(x)$,
where $y \in \{-1, 1\}$.
Assume a continuous kernel $k : \cX \times \cX \to \R$ with associated RKHS $\cF$
and canonical feature map $\varphi : \cX \to \cF$ with $k(x, x') = \inner{\varphi(x)}{\varphi(x')}_\cF$.
\begin{qlist}
\item \pts{10} Find the regularized loss minimizer
\begin{equation} \label{eq:min-lin}
\hat h_\lambda = \argmin_{h \in \cF} L_S(h) + \tfrac12 \lambda \norm{h}_{\cF}^2
\tag{RLM}
,\end{equation}
for a training sample $S = ( (x_1, y_1), \dots, (x_n, y_n) )$ and $\lambda > 0$.
\begin{answer}\TODO\end{answer}
\item \pts{5} Show that $L_S(\hat h_\lambda) = - \frac{1}{\lambda} \norm*{ \frac1n \sum_{i : y_i = 1} \varphi(x_i) - \frac1n \sum_{i : y_i = -1} \varphi(x_i) }_{\cF}^2$.
\begin{answer}\TODO\end{answer}
\item \pts{10} Find a (data-dependent) value of $\lambda$, call it $\hat\lambda$, such that $\norm{\hat h_{\hat\lambda}}_{\cF} = 1$,
and simplify the expression for $L_S(\hat h_{\hat\lambda})$.
\begin{answer}\TODO\end{answer}
\item \pts{5} Argue that $\hat h_{\hat\lambda}$ is a solution to
\begin{equation}
\min_{h \in \cF : \norm h_\cF \le 1} L_S(h)
\tag{ERM}
\label{eq:norm-1-erm}
.\end{equation}
Further argue that solving \eqref{eq:norm-1-erm} is equivalent to solving
\begin{equation}
\max_{h \in \cF : \norm h_\cF \le 1} \sum_{i : y_i = 1} h(x_i) - \sum_{i: y_i = -1} h(x_i)
\tag{MAX}
\label{eq:max-discrep}
,\end{equation}
i.e.\ finding a function high on the positively-labeled points and low on the negatively-labeled ones.
\begin{answer}\TODO\end{answer}
\end{qlist}
Let $\cP$ and $\cQ$ be probability distributions.
A distribution-level version of \eqref{eq:max-discrep} is known as the \emph{maximum mean discrepancy},
\[
\MMD(\cP, \cQ)
= \sup_{f \in \cF : \norm{f}_\cF \le 1} \E_{X \sim \cP} f(X) - \E_{Y \sim \cQ} f(Y)
.\]
Let $\varphi : \cX \to \cF$ be the canonical feature map
$k(x, x') = \inner{\varphi(x)}{\varphi(x')}_\cF$,
and assume for simplicity that $\sup_{x \in \cX} \norm{\varphi(x)} \le \kappa < \infty$.
Define the \emph{kernel mean embedding} of a distribution $\cP$ as
$\mu_\cP = \E_{X \sim \cP} \varphi(X)$;
for bounded kernels, this is guaranteed to exist.\footnote{As long as $\cP$ is a Borel measure, which is the kind of very mild assumption we don't worry about in this class.}
Moreover, you can move the expectation inside or outside of inner products:
for any $f \in \cF$,
\[
\inner{\mu_\cP}{f}_\cF
= \inner*{\E_{X \sim \cP} \varphi(X)}{f}_\cF
= \E_{X \sim \cP} \inner{\varphi(X)}{f}_\cF
= \E_{X \sim \cP} f(X)
.\]
\begin{qlist}[resume]
\item \pts{10}
Prove that
\[
\MMD(\cP, \cQ) = \norm{\mu_\cP - \mu_\cQ}_\cF
\]
and
\[
\MMD^2(\cP, \cQ) = \E_{\substack{X, X' \sim \cP\\Y, Y' \sim \cQ}}\Big[ k(X, X') - 2 k(X, Y) + k(Y, Y') \Big]
.\]
\begin{answer}\TODO\end{answer}
\item \pts{2 challenge}
\textcolor{black}{Let $\cX$ be a compact metric space.}
Prove that if $k$ is universal,
then $\MMD(\cP, \cQ) = 0$ implies $\cP = \cQ$.
\hint{You can use the following helpful result, where $C(\cX)$ is as usual the space of all bounded continuous functions $\cX \to \R$.
\begin{lemma}
Two Borel probability measures $\cP$ and $\cQ$ on a metric space $\cX$
are equal if and only if
for all $f \in C(\cX)$, $\E_{X \sim \cP} f(X) = \E_{Y \sim \cQ} f(Y)$.
\end{lemma}
}
\begin{answer}\TODO\end{answer}
\item \pts{3 challenge}
Prove that $k(x, y) = \norm{x} + \norm{y} - \norm{x - y}$,
where $\norm\cdot$ is the norm of any Hilbert space,
is a valid kernel.
Further show that the MMD with this kernel is exactly the
\emph{energy distance}, whose square is
\[
\rho(\cP, \cQ)^2
= 2 \E_{X \sim \cP, Y \sim \cQ} \norm{X - Y} - \E_{X, X' \sim \cP} \norm{X - X'} - \E_{Y, Y' \sim \cQ} \norm{Y - Y'}
.\]
\hint{You can use without proof that for all $n \ge 1$, for all $x_1, \dots, x_n$ and $c_1, \dots, c_n$ such that $\sum_{i=1}^n c_i = 0$, it holds that \[ \sum_{i=1}^n \sum_{j=1}^n c_i \norm{x_i - x_j} c_j \le 0 .\] You'll need to fiddle a bit from this inequality to get the desired result: how to get the $\norm x$ in $k$?}
\begin{answer}\TODO\end{answer}
\end{qlist}
\meta{We won't prove this, but it turns out that the energy distance is positive for any $\cP \ne \cQ$, but this $k$ actually \emph{isn't} universal.}
\clearpage
\section{Lasso and stability \pts{5 challenge}}
The Lasso algorithm uses
linear predictors $h_w(x) = w \cdot x$,
the square loss $\ell(h, (x, y)) = (h(x) - y)^2$,
and a $\norm w_1 = \sum_{j=1}^d \abs{w_j}$ regularizer:
\[
A_\lambda(S) \in \argmin_{w \in \R^d} L_S(h_w) + \lambda \norm w_1
.\]
(If there are multiple minimizers, let's have $A_\lambda$ return one uniformly at random from the set of possible minimizers.)
The Lasso algorithm is nice because it often returns sparse solutions,
i.e.\ $w$ with many $w_j = 0$.
Let's use $\cZ = \cX \times \cY = \{ x \in \R^d : \norm x \le C \} \times [-M, M]$ for simplicity.
\begin{qlist}
\item \pts{5} Show that the Lasso algorithm is not uniformly stable.
\textcolor{black}{That is, there is no $\beta(m)$ satisfying Definition 3 of \lec{13-stability}{the stability notes} such that $\beta(m) \to 0$ as $m \to \infty$.}
\hint{There's a reason I mentioned multiple minimizers above.}
\begin{answer}\TODO\end{answer}
\end{qlist}
\meta{%
I \emph{think} the Lasso algorithm for any $\lambda > 0$ \emph{is} actually on-average-replace-one stable under these assumptions on $\cZ$,
because any algorithm that on-average learns $\dist$ is on-average-replace-one-stable.
We can show this under these assumptions for the Lagrange dual problem to the Lasso,
ERM with $\cH = \{ h_w : \norm w_1 \le B \}$,
with Rademacher bounds (depending on $B$, $C$, and $M$).
But the relationship of $B$ to $\lambda$ is complicated,
and I don't even know how to get a worst-case upper bound on it,
though something might be possible.\footnote{
In fact, I'm not 100\% sure that Lasso even \emph{does} learn without any distributional assumptions.
For typical analyses with some distributional assumptions,
see Chapter 8 of the Bach book;
e.g.\ Exercise 8.5 is pretty close.
}
}
\end{document}