\documentclass{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{dsfont}
\usepackage{thmtools}
\usepackage{etoolbox}
\usepackage{enumitem}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{xcolor}
\usepackage{verbatim}
\usepackage[colorlinks,linkcolor=red!80!black]{hyperref}
\usepackage[noabbrev,capitalize]{cleveref}
% https://tex.stackexchange.com/a/42728/9019
\newcommand{\tagthis}[1]{\stepcounter{equation}\tag{\theequation}\label{#1}}
\definecolor{question}{rgb}{0,0,1}
\newlist{qlist}{enumerate}{1}
\setlist[qlist,1]{leftmargin=*, label=\textbf{(\alph*)}, ref={(\alph*)}}
\crefname{qlisti}{part}{parts}
\AtBeginEnvironment{qlist}{\color{question}}
\newcommand{\ask}[1]{\textcolor{question}{#1}}
\newenvironment{asking}{\begingroup\color{question}}{\endgroup}
\crefname{section}{Question}{Questions}
\definecolor{answer}{rgb}{0,.5,0}
\newcommand{\ans}[1]{\par\textcolor{answer}{Answer: #1}}
\newenvironment{answer}{\par\begingroup\color{answer}Answer: }{\endgroup}
\definecolor{points}{rgb}{0.6,0.3,0}
\newcommand{\pts}[1]{\textcolor{points}{[#1~points]}}
\newcommand{\hint}[1]{\textcolor{black!60!white}{\emph{Hint: #1}}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\bigO}{\mathcal{O}}
\newcommand{\Lip}{\mathrm{Lip}}
\DeclareMathOperator*{\E}{\mathbb{E}}
\newcommand{\dist}{\mathcal{D}}
\newcommand{\xdist}{\mathcal{D}_x}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\spn}{span}
\DeclareMathOperator{\tr}{tr}
\newcommand{\tp}{^\mathsf{T}}
\newcommand{\op}{\mathit{op}}
\newcommand{\cH}{\mathcal{H}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\cY}{\mathcal{Y}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\ud}{\mathrm{d}}
\newcommand{\ERM}{\mathrm{ERM}}
\newcommand{\SRM}{\mathrm{SRM}}
\newcommand{\RLM}{\mathrm{RLM}}
\DeclareMathOperator{\VCdim}{VCdim}
\DeclareMathOperator{\emprad}{\hat{\mathfrak{R}}_S}
\DeclareMathOperator{\rad}{\mathfrak{R}_n}
\newcommand{\bsigma}{\boldsymbol{\sigma}}
\makeatletter
% \abs{} uses regular-sized bars, \abs*{} uses auto-sized ones
\newcommand{\abs}{\@ifstar{\@Abs}{\@abs}}
\newcommand{\@abs}[1]{\lvert #1 \rvert}
\newcommand{\@Abs}[1]{\left\lvert #1 \right\rvert}
\newcommand{\norm}{\@ifstar{\@Norm}{\@norm}}
\newcommand{\@norm}[1]{\lVert #1 \rVert}
\newcommand{\@Norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\ceil}{\@ifstar{\@Ceil}{\@ceil}}
\newcommand{\@ceil}[1]{\lceil #1 \rceil}
\newcommand{\@Ceil}[1]{\left\lceil #1 \right\rceil}
\newcommand{\floor}{\@ifstar{\@Floor}{\@floor}}
\newcommand{\@floor}[1]{\lfloor #1 \rfloor}
\newcommand{\@Floor}[1]{\left\lfloor #1 \right\rfloor}
\makeatother
\declaretheorem[name=Theorem,numberwithin=section]{thm}
\declaretheorem[name=Proposition,sibling=thm]{prop}
\declaretheorem[name=Lemma,sibling=thm]{lemma}
\begin{document}
\begin{center}
\Large
CPSC 532S: Assignment 3 --
due Friday, 25 Mar 2022, 11:59pm
\end{center}
Default late policy: -5 points on the assignment if you're 0-24 hours late, -10 if you're 24-48 hours late, not accepted after that.
If you have something significant going on / are sick / whatever,
write to me, and I can be more flexible there.
This assignment is split into \textcolor{red}{four} questions (the parts with big section headers; most have sub-parts).
You can solve each question in groups of up to three.
Groups don't need to be consistent between problems;
you can do Q1 and Q2 alone, Q3 with Alice, and Q4 with Bob and Carlos if you want.
Please \textbf{do not} just split the questions up and do the parts separately.
\textbf{\color{red} If your name is on a solution, you are pledging that you contributed significantly to the solution and understand it fully.}
There is a separate Gradescope assignment for each problem;
use the Gradescope groups feature to submit once and associate with each of you,
but also put all of your names on the first page as a backup.
Prepare your answers to these questions \textbf{using \LaTeX}.
Hopefully you're reasonably familiar with it,
but if not, try using Overleaf and looking around for tutorials online.
(Note that free Overleaf accounts can only share with one ``named collaborator,''
but you can collaborate with more people by sending them an edit link.
Make sure you only share the parts of the homework you're handing in together!)
Feel free to ask questions if you get stuck on things on Piazza
(but remove any details about the actual answers\dots
feel free to make a private post if that's tough).
If you look stuff up anywhere other than in one of the two course textbooks,
please \textbf{cite your sources}:
just say in the answer to that question where you looked.
(A link is fine, no need for a formal citation.)
Please do not look at solution manuals or so on.
If you accidentally come across a solution while looking for something related,
still write the argument up in your own words, link to wherever you found it, and be clear about what happened.
If you like, the \texttt{.tex} source for this file is available on the course website,
and you can put your answers in
\verb|\begin{answ|\verb|er} My answer here... \end{answer}|
% that weird formatting so my answer-stripping script doesn't catch it :p
environments to make them stand out if so;
feel free to delete whatever boilerplate you want (or not, I'm not printing them out).
Or answer in a fresh document;
just make it clear which question you're answering where.
If you're using a consistent group and want to write your answers in one document,
you could split the PDF with e.g.\ \texttt{qpdf a2.pdf --pages . 2-3 -- q1.pdf}
or through the GUI of a PDF viewer.
Or you can upload the full file four times and just make sure you assign pages appropriately.
Submit your answers as a PDF on Gradescope:
\href{https://piazza.com/class/ky0u4qahwdn2xj?cid=17}{instructions on Piazza}.
You'll be prompted to mark where each sub-part is in your PDF;
make sure you mark all relevant pages for each part.
(This saves me a surprising amount of time in grading.)
If something goes wrong, you can also email your assignment to me directly (\texttt{dsuth@cs.ubc.ca}).
\clearpage
\section{ERM vs SRM vs RLM \pts{25}}
Let's say we're given a training set
$S = ((x_1, y_1), \dots, (x_n, y_n)) \sim \dist^n$,
with $x_i \in \R^d$ and \textcolor{red}{$y_i \in [-1, 1]$}.
We're going to try to learning a homogeneous linear function $h$ from
$\cH = \{ x \mapsto w\tp x : w \in \R^d \}$,
\textcolor{red}{using the loss function $\ell^\text{abs}(h, (x, y)) = \abs{h(x) - y}$.}
Let's use the notation $w_h$ to refer to the $w$ inside $h$, i.e.\ $h(x) = w_h^\top x$.
You can assume $n > d \ge 3$,
and that $\Pr(\norm x \le R) = 1$ for some finite $R$.
\begin{qlist}
\item \label{q:erm-gen} \pts{5}
Suppose we just run empirical risk minimization,
$h^\ERM_S = \argmin_{h \in \cH} L_S^\text{abs}(h)$.
What can we say about the generalization gap $L_\dist^\text{abs}(h^\ERM_S) - L_S^\text{abs}(h^\ERM_S)$ using a type of bound covered in class?
\end{qlist}
Instead, let's break $\cH$ into parts:
$\cH_1 = \{ x \mapsto w\tp x : \norm w \le a \}$,
$\cH_2 = \{ x \mapsto w\tp x : \norm w \le 2a \}$,
etc,
for a constant $a > 0$ to be determined later.
Now, run SRM, using ``weight'' $6 / (\pi^2 k^2)$ for $\cH_k$ as in class.
\begin{qlist}[resume]
\item \label{q:srm-setup} \pts{5}
Write the solution $h^\SRM_S$ in the form $\argmin_h f(L_S^\text{abs}(h), \norm{w_h}, a, n, R, \delta)$, for some function $f$, where $w_h$ is the vector $w$ ``inside'' $h$.
\textcolor{black}{Don't include terms explicitly depending on $\rad(\cH_k)$ or similar; it should be an expression that would be straightforward to implement in code. You can use the ceil function $\ceil* \cdot$ (rounding up) or floor $\floor* \cdot$ (rounding down), if you'd like.}
\hint{When we discussed SRM before, we focused on 0-1 loss. But there's nothing actually in the argument that requires it, as long as we have a uniform convergence bound on $\cH_k$. You may want to just re-derive it from the framework of minimizing an upper bound on $L_\dist(h)$.}
\item \label{q:srm-gen} \pts{5}
What can we say about the generalization gap $L_\dist^\text{abs}(h^\SRM_S) - L_S^\text{abs}(h^\SRM_S)$ using a type of bound covered in class?
\textcolor{black}{Again, this should be a closed-form expression in terms of $\norm{w_h}$, $a$, $R$, $n$, and $\delta$.}
\end{qlist}
Now, the last of our candidate algorithms is regularized loss minimization:
\[
h^\RLM_S = \argmin_{h \in \cH} L_S^\text{abs}(h) + \lambda \norm{w_h}^2
.\]
\begin{qlist}[resume]
\item \label{q:rlm-gen} \pts{5}
What can we say about the generalization gap $L_\dist^\text{abs}(h^\RLM_S) - L_S^\text{abs}(h^\RLM_S)$ using a type of bound covered in class?
\item \label{q:srm-vs-rlm} \pts{5}
Can we pick $\lambda$ and $a$ (maybe depending on $n$) so that RLM looks roughly like SRM?
\textcolor{black}{(It's okay to be a little approximate here, e.g.\ ignoring logarithmic terms; we're just talking about motivations.)}
\end{qlist}
\clearpage
\section{Logistic regression \pts{25}}
Let $\cX = \{ x \in \R^d : \norm x \le R \}$,
and $\cH = \{ w \in \R^d : \norm w \le B \}$.
We'll learn a linear predictor based on logistic loss,
\[
\ell(w, (x, y))
= \log(1 + \exp(- y w\tp x))
.\]
Let $S$ be a sample $((x_1, y_1), \dots, (x_n, y_n)) \in \left(\cX \times \{-1, 1\}\right)^n$,
and let $X \in \R^{n \times d}$, $y \in \R^n$ stack up the features and labels accordingly.
\begin{qlist}
\item \pts{9} Show that $L_S(w)$ is a convex function of $w$.
\item \pts{8} Show $L_S(w)$ is $\rho$-Lipschitz, and give a (reasonably tight) upper bound on $\rho$.
\item \pts{8} Show $L_S(w)$ is $\beta$-smooth, and give a (reasonably tight) upper bound on $\beta$.
\end{qlist}
As $\norm w$ is bounded, this means that logistic regression is both Convex-Lipschitz-Bounded and Convex-Smooth-Bounded.
\clearpage
\section{From Bounded Expected Risk to Agnostic PAC Learning \pts{25}} \label{q:pac}
Our stability and SGD analyses mostly bounded only the expected risk;
we'll now show this implies PAC learning.
Let $A$ be a \textcolor{red}{proper} learning algorithm (\textcolor{red}{one returning hypotheses in $\cH$}) that guarantees: if $n \ge n_\cH(\varepsilon)$,
then for every distribution $\dist$, it holds that
\[
\E_{S \sim \dist^n} L_{\dist}(A(S)) \le \inf_{h \in \cH} L_\dist(h) + \varepsilon
.\]
\textcolor{red}{You can assume that $L_\dist(h) = \E_{z \sim \dist} \ell(h, z)$ for a loss $\ell(h, z)$ bounded in $[0, 1]$.}
\begin{qlist}
\item \label{q:exp-loss} \pts{10}
Show that for every $\delta \in (0, 1)$, if $n \ge n_\cH(\varepsilon \delta)$, then with probability of at least $1 - \delta$ it holds that $L_\dist(A(S)) \le \inf_{h \in \cH} L_\dist(h) + \varepsilon$.
\hint{Observe that the random variable $L_\dist(A(S)) - \inf_{h \in \cH} L_\dist(h)$ is nonnegative, and rely on Markovâ€™s inequality.}
\item \label{q:exp-pac} \pts{15} For every $\delta \in (0, 1)$, let
\[
\color{red}
n_\cH(\varepsilon, \delta)
= n_\cH(\tfrac \varepsilon 4) \ceil*{\log_2 \tfrac2\delta}
+ 8 \ceil*{\frac{\log \frac4\delta + \log \ceil{\log_2 \tfrac2\delta}}{\varepsilon^2}}
.\]
Suggest a procedure that agnostic PAC learns the problem with sample complexity of $n_\cH(\varepsilon, \delta)$, assuming that the loss function is bounded by 1.
\hint{Let $k = \ceil{\log_2(\tfrac{\color{red} 2}\delta)}$. Divide the data into $k+1$ chunks, where each of the first k chunks has at least $n_\cH(\varepsilon / {\color{red} 4})$ examples. Train the first $k$ chunks using $A$. Using \cref{q:exp-loss}, argue that the probability that for all of these chunks we have $L_\dist(A(S)) > \inf_{h \in \cH} L_\dist(h) + \varepsilon {\color{red} {} / 2}$ is at most $2^{-k} \le \delta/2$. Finally, use the last chunk as a validation set.}
\end{qlist}
\clearpage
\section{Perceptrons \pts{25}}
Let $S = ((x_1, y_1), \dots, (x_n, y_n)) \in \left( \R^d \times \{ \pm 1 \} \right)^n$.
Assume that a
\[
w^* \in \argmin_{w \in \R^d : \; \forall i \in [n], \; y_i w\tp x_i \ge 1} \norm{w}
\]
exists.
Let $R = \max_i \norm{x_i}$,
and let \[
f(w) = \max_{i \in [n]} 1 - y_i w\tp x_i
.\]
\begin{qlist}
\item \pts{5} Show that $\min_{w : \norm w \le \norm{w^*}} f(w) = 0$, and that any $w$ for which $f(w) < 1$ achieves $L_S^\mathrm{0-1}(w) = 0$.
\item \pts{5} Show how to calculate a subgradient of $f$.
\hint{Recall that, at points for which $f$ is differentiable, the gradient is a valid subgradient. For points where it's not, think about the structure of $f$: try drawing a sketch with $d = 1$ and $n = 2$.}
\item \label{q:subgd} \pts{5} Describe subgradient descent on the function $f$ (initializing from $w = 0$). \textcolor{red}{UPDATE: no need to analyze the algorithm.}
\item \pts{5} Compare the resulting algorithm \textcolor{red}{(not its analysis anymore)} to the Batch Perceptron algorithm given in section 9.1.2 of SSBD. (We didn't discuss this in class, but just reading the algorithm block should be enough.)
\item \pts{5} Suppose that the training set $S$ is such that the training instances are linearly separable with a margin of $\gamma$. Refine the bound \textcolor{red}{of SSBD's Theorem 9.1} to include $\gamma$.
\end{qlist}
\end{document}