\documentclass{article}

\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{nicefrac}
\usepackage{listings} % For displaying code
\usepackage{algorithm2e} % pseudo-code

% Answers
\def\ans#1{\par\gre{Answer: #1}}
%\def\ans#1{} % Comment this line to produce document with answers

% Colors
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}

% Math
\def\R{\mathbb{R}}
\def\E{\mathbb{E}}
\def\argmax{\mathop{\rm arg\,max}}
\newcommand{\argmin}[1]{\mathop{\hbox{argmin}}_{#1}}
\newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\alignStar}[1]{\begin{align*}#1\end{align*}}
\def\half{\frac 1 2}
\def\cond{\; | \;}

% LaTeX
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{a3f/#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{a3f/#2}\end{center}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}


\begin{document}

\title{CPSC 540 Assignment 3 (due Friday March 12 at midnight)}
\author{}
\date{}
\maketitle
\vspace{-4em}


\blu{\enum{
\item Name(s):
\item Student ID(s):
}}


\section{Markov Chains}

\subsection{Inference with Discrete States}

The function \emph{example\_markovChain.jl} loads the initial state probabilities and transition probabilities for a Markov chain model,
\[
p(x_1,x_2,\dots,x_d) = p(x_1)\prod_{j=2}^{d}{p(x_j\cond x_{j-1})},
\]
corresponding to the ``grad student Markov chain'' from class.

\enum{
\item Write a function, \emph{sampleAncestral}, that uses ancestral sampling to sample a sequence $x$ from this Markov chain of length $d$. \blu{Hand in this code and report the univariate marginal probabilities for time 50 using a Monte Carlo estimate based on 10000 samples.}\\
Hint: you can use \emph{sampleDiscrete} in \emph{misc.jl} to sample from a discrete probability mass function using the inverse transform method.
\item Write a function, \emph{marginalCK}, that uses the CK equations to compute the exact univariate marginals up to a given time $d$. \blu{Hand in this code, report all exact univariate marginals at time 50, and report how this differs from the marginals in the previous question.}
\item What is the state $c$ with highest marginal probability, $p(x_j = c)$, for each time $j$?
\item Write a function, \emph{viterbiDecode}, that uses the Viterbi decoding algorithm for Markov chains to find the optimal decoding up to a time $d$. \blu{Hand in this code and report the optimal decoding of the Markov chain up to time 50 and up to 100.}
\pagebreak
\item \blu{Report all the univariate conditional probabilities at time 50 if the student starts in grad school, $p(x_{50} = c \cond  x_1 = 3)$ for all $c$}. Hint: you should be able to do this by changing the input to the CK equations.
\item \blu{Report for all $c$ the univariate conditional probabilities $p(x_5 = c \cond  x_{10} = 6)$ (``where  you were likely to be 5 years after graduation if you ended up in academia after 10 years'') obtained using a Monte Carlo estimate based on 10000 samples and rejection sampling. Also report the number of samples accepted among the 10000 samples.}
\item \blu{Give code implementing a dynamic programming approach to exactly compute $p(x_5 = c \cond  x_{10} = 6)$, and report the exact values for all $c$.}
\item Why is $p(x_j = 7 \cond x_{10} = 6)$ equal to zero for all $j$ less than $10$?
}
Hint: for some of the quesitons you may find it helpful to use a $k$ by $d$ matrix $M$ to represent a dynamic programming table

\pagebreak

\subsection{Inference with Gaussian States}

Consider a continuous-state Markov chain where the initial distribution is given by
\[
x_0 \sim \mathcal{N}(m_0, v_0^2),
\]
and the transition distributions for $j > 1$ are given by
\[
x_j \cond x_{j-1} \sim \mathcal{N}(w_jx_{j-1} + m_j, v_j^2).
\]
This model could be used to model an object moving through $\R$.\footnote{In practical applications like object tracking, we typically have that the states $x_j$ are 2- or 3-dimensional if we are modeling an object moving through space, or even higher-dimensional if we are modeling things like stock prices.} Because of the Gaussian assumptions, this defines a joint Gaussian distribution over the variables while the marginal distributions are also Gaussian. \blu{For a generic $j > 1$, derive the form of the marginal distribution of $x_j$, expressing the marginal parameters $\mu_j$ and $\sigma_j$ recursively in terms of the parameters $\mu_{j-1}$ and $\sigma_{j-1}$ of the previous marginal, $p(x_{j-1}) \sim \mathcal{N}(\mu_{j-1}, \sigma_{j-1}^2)$.}

Hint: You can use Theorem 4.4.1 of Murphy's book.
\pagebreak

\subsection{Learning with Discrete States}

If you run \emph{example\_rain} it will: load the Vancouver rain data set, split it into a training and validation set, fit an indepdendent (and homogeneous) Bernoulli model to the training set, and then compute the negative log-likeilhood (NLL) of this model on the validation set (a lower validation NLL means a better fit). As discussed in class, we expect that a Markov chain could be a better model of this dataset.

\enum{
\item \blu{Give code for finding the MLE for the initial probabilities and transition probabilities in a homogeneous Markov chain, and report the MLE values for the training set.}
\item \blu{Report the NLL of the Markov chain model on the validation set.}
}
\pagebreak

\section{Directed Acyclic Graphical Models}


\subsection{D-Separation}

Consider a  directed acyclic graphical (DAG) model with the following graph structure:
%\centerfig{.4}{DAG}

Assuming that the conditional independence properties are faithful to the graph, using d-separation \blu{briefly explain why the following are true or false:}
\enum{
\item $H \perp I$.
\item $H \perp I \cond A$.
\item $H \perp I \cond J$.
\item $H \perp I \cond A, J$.
\item $C \perp I$.
\item $C \perp I \cond J$.
\item $C \perp I \cond H$.
\item $C \perp I \cond A, H$.
\item $C \perp I \cond E, H, J$.
}

\pagebreak

\subsection{Exact Inference}

Consider a  directed acyclic graphical (DAG) model with the following graph structure:
%\centerfig{.4}{DAG2}
Assume that all variables are binary and that we use the following parameterization of the network:
\begin{align*}
& p(A = 1) = 0.7\\
& p(B = 1 \cond A = 0) = 0.8\\
& p(B = 1 \cond A = 1) = 1.0\\
& p(C = 1) = 0.8\\
& p(D = 1 \cond B = 0) = 0.8\\
& p(D = 1 \cond B = 1) = 0.6\\
& p(E = 1 \cond B = 0, C = 0) = 0.3\\
& p(E = 1 \cond B = 0, C = 1) = 0.7\\
& p(E = 1 \cond B = 1, C = 0) = 0.4\\
& p(E = 1 \cond B = 1, C = 1) = 0.5\\
& p(F = 1 \cond A = 0) = 0.5\\
& p(F = 1 \cond A = 1) = 0.9\\
& p(G = 1 \cond E = 0, F = 0) = 0.5\\
& p(G = 1 \cond E = 0, F = 1) = 0\\
& p(G = 1 \cond E = 1, F = 0) = 0.1\\
& p(G = 1 \cond E = 1, F = 1) = 0.1
\end{align*}


\blu{Compute the following quantities}:
\enum{
\item $p(A = 0)$.
\item $p(B = 1 \cond A = 0)$.
\item $p(B = 1)$.
\item $p(D = 1)$.
\item $p(B = 1 \cond D = 1)$.
\item $p(B = 1 \cond C = 1)$.
\item $p(B = 1 \cond A = 0, C = 1, F = 1)$.
}
Hints: some of the above quantities can be read from the table, some require using that probabilities sum to 1, some require the marginalization rule, some require Bayes rule, some require using [conditional] independence, and some will be simplified using calculations from previous sub-questions.
\pagebreak


\subsection{Learning in DAGs}

The file \emph{onTime.jld} contains a matrix $X$ containing samples from the following DAG model:\footnote{from here: \url{https://www.uib.no/en/rg/ml/119695/bayesian-networks}}
%\centerfig{.9}{onTime}
The first column of $X$ is the ``alarm'' variable, the second is ``bus late'', the third is ``over-slept'', and the last is ``on time''.
\enum{
\item Assuming the DAG structure above, \blu{give code for computing the MLE of the parameters in the model, and report the MLE values up to 2 decimal places}.
\item \blu{Give code for generating samples using the MLE parameters}.
\item The combination $(0,1,0,1)$ does not occur in the training data, so if we fit the MLE general discrete distribution to this data we would get that it has a probability of 0. This is in contrast to the true model above, where we have $p(0,1,0,1) = 0.0004$. \blu{Does your model give a better or worse estimate of the true proabibility of this event than the general discrete distribution? Why do you think it gives a better/worse estimate?}
}
\pagebreak

\section{Undirected Graphical Models}

Cancelled.

\pagebreak

\section{Very-Short Answer Questions}


Give a short and concise 1-sentence answer to the below questions.
\enum{
\item What is the difference between computing marginals and computing the stationary distribution of a Markov chain.
\item What is the inverse transform method used for?
\item Describe how we could use ancestral sampling to sample from the joint density over $x$ and $y$ defined by a Gaussian discriminant analysis model.
\item Suppose you had a black box that could generate IID samples from a distribution. Describe how you could use a Monte Carlo method to approximate $p(x \leq c)$ for this distribution.
\item What is the cost of generating a sample from a Markov chain of length $d$ with $k$ possible states for each time? What is the cost of decoding?
\pagebreak
\item What is the difference between inference and decoding in Markov chains?
\item Suppose we are using a hidden Markov model to track the location of a submarine using sonar measurements. What would $x_j$ and $z_j$ represent in this example?
\item What is ``explaining away''?
\item If two variables are not d-separated, are they necessarily depenent? If two variables are d-separated, are they necessarily independent?
\item What is an advantage and a disavantage of using logistic regression to parameterize the CPDs in DAGs compared to a tabular representation?
\pagebreak
\item When decoding a DAG, why does the order that we compute the messages matter?
\item What is the relationship between multivariate Gaussians and UGMs?
\item What is the relevance of the Markov blanket in ICM?
\item Why might we do ``thinning'' of the samples when we use Gibbs sampling?
}

\pagebreak

\section*{CPSC 540 Relevant Papers for Project}

\blu{This section is only to be done by students enrolled in CPSC 540}. For students in CPSC 440, the description for your project will come with Assignment 4.

\subsection*{Finding Relevant Papers}

To help you make progress on your project, for this part you should \blu{hand in a list of 10 academic papers} related to your current project topic. Finding related work is often one of the first steps towards getting a new project started, and it gives you an idea of what has (and has not) been explored. Some strategies for finding related papers are:
\enum{
\item Use Google: try the keywords you think are most relevant. Asking people in your lab (or related labs) for references is also often a good starting point.
\item Once you have found a few related papers, read their introduction section to find references that these papers think are worth mentioning.
\item Once you have found a few related papers, use Google Scholar to look through the list of references that are \emph{citing} these papers (particularly for recent ones). You may have to do some sifting if there are a lot of citations. Reasonable criteria to sift through large reference lists include looking for the ones with the most citations or focusing on the more recent ones (then returning to Step 2 to find the more-relevant older references).
}
For this question you only need to provide a list, but in the final assignment you will have to do a survey of at least 10 papers. So it's worth trying to identify papers that are both relevant and important at this point.
For some types of projects it will be easier to find papers than others. If you are having trouble, post on Piazza.

Although the papers do not need to all be machine learning papers, the course project does need to be related to machine learning in some way, so at least a subset of the papers should be machine learning papers. Here is a rough guide to some of the most reputable places to where you see machine learning works published:
\items{
\item The International Conference on Machine Learning (ICML) and the conference on Advances in Neural Information Processing (NeurIPS) are the top places to publish machine learning work. The Journal of Machine Learning Research (JMLR) is the top journal, although in this field conference publications are usually viewed as more prestigious.
\item Other good venues include AISTATS (emphasis on statistics), UAI (emphasis on graphical models), COLT (emphasis on theory), ICLR (emphasis on deep learning), ECML-PKDD (European version of ICML), CVPR and ICCV/ECCV (emphasis on computer vision), ACL and EMNLP (emphasis on language), KDD (emphasis on data mining),  AAAI/IJCAI (emphasis on AI more broadly), JRSSB and Annals of Stats (emphasis on statistics more broadly), and Science and Nature (emphasis on science more broadly).
}

\subsection*{Paper Review}

Among your list of 10 papers, choose one paper and \blu{write a short review of this paper}. It makes sense to choose a paper that is closely-related to your project or to choose one of the most important-looking papers. The review should have two parts:
\enum{
\item A short summary of the contributions of the paper. Say what problem the paper is addressing, why this is an important problems, what is proposed, and how it is being evaluated.
\item A list of strengths and weaknesses of the paper, and whether the claims are appropriately evaluated. For ideas of what issues to discuss, see the JMLR guidelines for reviewers:\\
\url{http://www.jmlr.org/reviewer-guide.html}
}
Note that you should include a non-empty list of strengths \emph{and} weaknesses. 
Many students when doing their first reviews focus either purely on strengths or purely on weaknesses. It's important to recognize that all papers have weaknesses or limitations (even ones written by famous people or that are published in impressive places or that proved to be historically important) and all papers have strengths or at least a motivation (the authors must have thought it was worth writing for some reason).


\end{document}