Hidden Markov Model (HMM) Toolbox

Hidden Markov Model (HMM) Toolbox

Written by Kevin Murphy, 1998.
Last updated: 14 May 2001.

This toolbox supports inference and learning for HMMs with discrete outputs (dhmm's), Gaussian outputs (ghmm's), or mixtures of Gaussians output (mhmm's). The Gaussians can be full, diagonal, or spherical (isotropic). It also supports discrete inputs, as in a POMDP. The inference routines support filtering, smoothing, and fixed-lag smoothing.

What is an HMM?

An HMM is a Markov chain, where each state generates an observation. You only see the observations, and the goal is to infer the hidden state sequence. HMMs are very useful for time-series modelling, since the discrete state-space can be used to approximate many non-linear, non-Gaussian systems.

HMMs and some common variants (e.g., input-output HMMs) can be concisely explained using the language of Bayesian networks, as we now demonstrate.

Consider the Bayesian network in Figure (a), which represents a hidden Markov model (HMM). (Circles denote continuous-valued random variables, squares denote discrete-valued, clear means hidden, shaded means observed.) This encodes the joint distribution

P(Q,Y) = P(Q_1) P(Y_1|Q_1) P(Q_2|Q_1) P(Y_2|Q_2) ...
For a sequence of length T, we simply ``unroll'' the model for T time steps. In general, such a dynamic Bayesian network (DBN) can be specified by just drawing two time slices (this is sometimes called a 2TBN) --- the structure (and parameters) are assumed to repeat.

The Markov property states that the future is independent of the past given the present, i.e., Q_{t+1} \indep Q_{t-1} | Q_t. We can parameterize this Markov chain using a transition matrix, M_{ij} = P(Q_{t+1}=j | Q_t=i), and a prior distribution, \pi_i = P(Q_1 = i).

We have assumed that this is a homogeneous Markov chain, i.e., the parameters do not vary with time. This assumption can be made explicit by representing the parameters as nodes: see Figure(b): P1 represents \pi, P2 represents the transition matrix, and P3 represents the parameters for the observation model. If we think of these parameters as random variables (as in the Bayesian approach), parameter estimation becomes equivalent to inference. If we think of the parameters as fixed, but unknown, quantities, parameter estimation requires a separate learning procedure (usually EM). In the latter case, we typically do not represent the parameters in the graph; shared parameters (as in this example) are implemented by specifying that the corresponding CPDs are ``tied''.

An HMM is a hidden Markov model because we don't see the states of the Markov chain, Q_t, but just a function of them, namely Y_t. For example, if Y_t is a vector, we might define P(Y_t=y|Q_t=i) = N(y; \mu_i, \Sigma_i). A richer model, widely used in speech recognition, is to model the output (conditioned on the hidden state) as a mixture of Gaussians. This is shown below.

Some popular variations on the basic HMM theme are illustrated below (from left to right: an input-output HMM, a factorial HMM, a coupled HMM). (In the input-output model, the CPD P(Q|U) could be a softmax function, or a neural network.) If we have software to handle inference and learning in general Bayesian networks (such as my Bayes Net Toolbox), all of these models becomes trivial to implement.

This software is designed for simple HMMs. If you want to use these more complicated alternatives, use my Bayes Net Toolbox.

How to use the HMM toolbox

HMMs with discrete outputs

Maximum likelihood parameter estimation using EM (Baum Welch)

The script learn_dhmm_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=2 output symbols. We create random stochastic matrices as follows.
O = 3;
Q = 2;
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
obsmat0 = mk_stochastic(rand(Q,O));  
Now we sample nex=10 sequences of length T=10 each from this model, to use as training data.
data = sample_dhmm(prior0, transmat0, obsmat0, T, nex);  
Now we make a random guess as to what the parameters are,
prior1 = normalise(rand(Q,1));
transmat1 = mk_stochastic(rand(Q,Q));
obsmat1 = mk_stochastic(rand(Q,O));
and improve our guess using 5 iterations of EM...
max_iter = 5;
[LL, prior2, transmat2, obsmat2] = learn_dhmm(data, prior1, transmat1, obsmat1, max_iter);
LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

Sequence classification

To evaluate the log-likelihood of a trained model given test data, proceed as follows:
loglik = log_lik_dhmm(data, prior, transmat, obsmat)
Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:
B = mk_dhmm_obs_lik(data, obsmat) 
Then you can use
[path, loglik] = viterbi_path(prior, transmat, B)   

HMMs with mixture of Gaussians outputs

Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=10 vector-valued sequences of length T=5; each vector has size O=2.
O = 2;
T = 5;
nex = 10;
data = randn(O,T,nex);
Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.
M = 2;
Q = 2;
left_right = 0;
[prior0, transmat0, mixmat0, mu0, Sigma0] =  init_mhmm(data, Q, M, 'diag', left_right);
Finally, let us improve these parameter estimates using EM.
max_iter = 5;
[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...
    learn_mhmm(data, prior0, transmat0, mu0, Sigma0, mixmat0, max_iter);
Since EM only finds a local optimum, good initialisation is crucial. The procedure implemented in init_mhmm is very crude, and is probably not adequate for real applications...

Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:
loglik = log_lik_mhmm(data, prior, transmat, mixmat, mu, Sigma);

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:
B = mk_mhmm_obs_lik(data, mu, Sigma, mixmat);
Finally, use
[path, loglik] = viterbi_path(prior, transmat, B);

HMMs with Gaussian outputs

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix. The learning routine is called as follows:
[LL, prior1, transmat1, mu1, Sigma1] = ...
    learn_mhmm(data, prior0, transmat0, mu0, Sigma0,  max_iter);
The classification routine is called as follows:
loglik = log_lik_ghmm(data, prior, transmat, mu, Sigma);
The likelihood routine is called as
B = mk_ghmm_obs_lik(data, mu, Sigma);

Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script online_em_demo gives an example of how to do this.

Other matlab packages for HMMs

Recommended reading