# How to use the HMM toolbox

## HMMs with discrete outputs

### Maximum likelihood parameter estimation using EM (Baum Welch)

The script dhmm_em_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=3 output symbols. We create random stochastic matrices as follows.
```O = 3;
Q = 2;
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
obsmat0 = mk_stochastic(rand(Q,O));
```
Now we sample nex=20 sequences of length T=10 each from this model, to use as training data.
```T=10;
nex=20;
data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);
```
Here data is 20x10. Now we make a random guess as to what the parameters are,
```prior1 = normalise(rand(Q,1));
transmat1 = mk_stochastic(rand(Q,Q));
obsmat1 = mk_stochastic(rand(Q,O));
```
and improve our guess using 5 iterations of EM...
```[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);
```
LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

### Sequence classification

To evaluate the log-likelihood of a trained model given test data, proceed as follows:
```loglik = dhmm_logprob(data, prior2, transmat2, obsmat2)
```
Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

### Computing the most probable sequence (Viterbi)

First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:
```B = multinomial_prob(data, obsmat);
```
Then you can use
```[path] = viterbi_path(prior, transmat, B)
```

## HMMs with mixture of Gaussians outputs

### Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=50 vector-valued sequences of length T=50; each vector has size O=2.
```O = 2;
T = 50;
nex = 50;
data = randn(O,T,nex);
```
Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.
```M = 2;
Q = 2;
left_right = 0;

prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));

[mu0, Sigma0] = mixgauss_init(Q*M, reshape(data, [O T*nex]), cov_type);
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));

```
Finally, let us improve these parameter estimates using EM.
```[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...
mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);
```
Since EM only finds a local optimum, good initialisation is crucial. The initialisation procedure illustrated above is very crude, and is probably not adequate for real applications... Click
here for a real-world example of EM with mixtures of Gaussians using BNT.

### What to do if the log-likelihood becomes positive?

It is possible for p(x) > 1 if p(x) is a probability density function, such as a Gaussian. (The requirements for a density are p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your covariance is shrinking to a point/delta function, so you should increase the width of the prior (see below), or constrain the matrix to be spherical or diagonal, or clamp it to a large fixed constant (not learn it at all). It is also very helpful to ensure the components of the data vectors have small and comparable magnitudes (use e.g., KPMstats/standardize).

This is a well-known pathology of maximum likelihood estimation for Gaussian mixtures: the global optimum may place one mixture component on a single data point, and give it 0 covariance, and hence infinite likelihood. One usually relies on the fact that EM cannot find the global optimum to avoid such pathologies.

### What to do if the log-likelihood decreases during EM?

Since I implicitly add a prior to every covariance matrix (see below), what increases is loglik + log(prior), but what I print is just loglik, which may occasionally decrease. This suggests that one of your mixture components is not getting enough data. Try a better initialization or fewer clusters (states).

### What to do if the covariance matrix becomes singular?

Estimates of the covariance matrix often become singular if you have too little data, or if too few points are assigned to a cluster center due to a bad initialization of the means. In this case, you should constrain the covariance to be spherical or diagonal, or adjust the prior (see below), or try a better initialization.

### How do I add a prior to the covariance matrix?

Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior is initialized to 0.01*I. This is added to the maximum likelihood estimate after every M step. To change this, you will need to modify the mhmm_em function so it calls mixgauss_Mstep with a different value.

### Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:
```loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);
```

### Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:
```B = mixgauss_prob(data(:,:,ex), mu, Sigma, mixmat);
```
where data(:,:,ex) is OxT where O is the size of the observation vector. Finally, use
```[path] = viterbi_path(prior, transmat, B);
```

## HMMs with Gaussian outputs

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix.

## Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script dhmm_em_online_demo gives an example of how to do this.