In [18]:

```
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
```

From this point it's just algebra (with some completing the square) to manipulate this into the form: $$ \exp\left(-\frac{(x - a)^2}{2b^2}\right) $$ For some $a$ and $b$ (which you need to find!). That implies that $p(\mu | x,\alpha, \sigma^2, \gamma^2)\sim \mathcal{N}(a, b)$

We began by going through Section 2.1 of Mark's notes on EM for semi-supervised learning. There's a lot going on, so I'd encourage you to go through them again.

Those notes show that our E-step requires us to compute: $$ r_0^i = p(\tilde{y}^i = 0|\tilde{x}_i, \theta_t)\qquad\text{and}\qquad r_1^i = p(\tilde{y}^i = 1|\tilde{x}_i, \theta_t) $$

From question 2, we have that: $$ \begin{align} p(\tilde{y}^i = 0|\tilde{x}_i, \theta_t) &= p(x = \tilde{x}_i|y=0,\mu^0_t, \Sigma^0_t) * p(y=0|\theta_t) / Z\\ &= MVN(x=\tilde{x}_i|\mu^0_t, \Sigma^0_t) * \theta^0_t/ Z \end{align} $$ Where MVN is the pdf of the multivariate normal. So we can calculate $r_0^i$ and $r_1^i$ by just plugging our parameters at iteration $t$ into the formulae we have and then re-normalizing.

For the M-step, we have to derive the maximum likelihood estimates for our parameters from the following equation (from Mark's notes): $$ Q(\theta|\theta_t) = \sum_{i=1}^n \log p(y_i ,x_i |\theta)+ \sum_{i=1}^t\sum_{\tilde{y}_i\in\{0,1\}} r_{\tilde{y}_i}^i\log p(\tilde{y}_i ,\tilde{x}_i |\theta). $$ So, for example, we'd solve for $\mu_0$ in $\nabla_{\mu_0}Q(\theta|\theta_t) = 0$.

Now, notice that this is very simmilar to what we had to do in question 2. The first term is exactly the same as what we had previously, at the second term (the sum over $t$) is just a weighted sum of what we had before where we imagine that $\tilde{y}_i = c$. So using what we did last week, we have: $$ \begin{align} \nabla_{\mu_0} Q(\theta|\theta_t) &= \sum_{i=1}^n \mathbb{I}(y_i = 0)\Sigma_0^{-1}(x_i - \mu_0) + \sum_{i=1}^t\sum_{\tilde{y}_i\in\{0,1\}} r_{\tilde{y}_i}^i \mathbb{I}(y_i = 0)\Sigma_0^{-1}(x_i - \mu_0).\\ &= \sum_{i=1}^n \mathbb{I}(y_i = 0)\Sigma_0^{-1}(x_i - \mu_0) + \sum_{i=1}^t r_{0}^i\Sigma_0^{-1}(x_i - \mu_0).\\ \end{align} $$ Setting this equal to zero we get: $$ \begin{align} \sum_{i=1}^n \mathbb{I}(y_i = 0)\Sigma_0^{-1}(x_i - \mu_0) &+ \sum_{i=1}^t r_{0}^i\Sigma_0^{-1}(x_i - \mu_0) = 0\\ \sum_{i=1}^n \mathbb{I}(y_i = 0)x_i + \sum_{i=1}^t r_{0}^i x_i & = \mu_0 \sum_{i=1}^n \mathbb{I}(y_i = 0) + \sum_{i=1}^t r_{0}^i\mu_0 \\ \mu_0 &= \frac{1}{n_0 + r_{0}} ( \sum_{i=1}^n \mathbb{I}(y_i = 0)x_i + \sum_{i=1}^t r_{0}^i x_i ) \\ \end{align} $$ where $r_{0} = \sum_{i=1}^t r_{0}^i$. Notice that this formula makes sense, because the $r_{0}^i$ act as a sort of "soft" indicator function.

You can derive the rest of the parameters similarly.

In [7]:

```
from keras.datasets import mnist # you can download MNIST from the course website, but I had this package available
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```

In [8]:

```
def binarize(X):
# Convert to 0 / 1 values
return 1.*(X/255. > 0.5)
X_train = binarize(X_train)
X_test = binarize(X_test)
```

In [13]:

```
plt.imshow(X_train[5.,:], cmap='Greys', interpolation='nearest')
```

Out[13]:

In [14]:

```
n_1 = X_train.sum(axis=0) # get parameter estimates for 1s
n_0 = (1-X_train).sum(axis=0) # get parameter estimates for 1s
theta = n_1 / (n_0 + n_1) # p(x = 1 | theta)
```

In [15]:

```
plt.imshow(theta, cmap='Greys', interpolation='nearest')
```

Out[15]:

In [20]:

```
def nll(X, theta):
return -(X * np.log(theta)[None,:,:]).sum()
```

In [22]:

```
nll(X_test, theta) # we get a nan because there are numbers where we get 0 * -inf = nan
```

Out[22]:

In [16]:

```
theta_lap = (n_1+1) / ((n_0+1) + (n_1+1))
```

In [17]:

```
plt.imshow(theta_lap, cmap='Greys', interpolation='nearest')
```

Out[17]:

In [25]:

```
nll(X_test, theta_lap) / X_test.shape[0] # I'm dividing by the number of test examples to get average nll.
```

Out[25]:

We talked briefly about mixture of experts models (which you can fit via EM) and mixture density networks (which you fit by just optimizing the likelihood function). These models treat the parameters of a mixture of gaussian model, $\pi, \mu$ and $\sigma$ as functions of some features in your data, so we have $\pi(x_i), \mu(x_i)$ and $\sigma(x_i)$. This allows you to use mixtures of gaussians for supervised density estimation.

For a nice discussion of mixture of experts models, see either Murphy or Chris Bishop's Pattern Recognition and Machine Learning. Bishop also has a good discussion on Mixture density network (he's the author of the original paper on the topic).

Alex Graves generated renewed interest in mixture density networks by using them for his handwriting models that use a recurrent network to output $\pi(x_i), \mu(x_i)$ and $\sigma(x_i)$. These models are nice flexible approaches to modeling continuous distributions without having to discritize them.

In [ ]:

```
```