This is a set of Matlab routines I wrote for the course STAT535D: Statistical
Computing and Monte Carlo Methods
by A. Doucet. It implements
different Markov Chain Monte Carlo (MCMC) strategies for sampling from the
posterior distribution over the
parameter values for binary Probit and Logistic Regression models with a
Gaussian prior on the parameter values. Specifically, we are sampling from:
P(y=1|w,x) = f(w'x)
w ~ N(0,v)
In the above, x is a set of p features, y is the class label (-1
or 1), w is the parameters we want to estimate, N(0,v) denotes
the prior Normal distribution on w (with mean 0 and inverse
covariance matrix v). For Logistic Regression, f(a) is the
sigmoid function 1/(1+exp(-a)), while for Probit Regression it is the
Gaussian cumulative distribution function.
The second Probit Regression sampling strategy (probit2Sample.m) uses the same model, but implements the Composition sampler of Holmes and Held ("Bayesian auxiliary variable models for binary and polychotomous regression", 2004). This model jointly samples w and z, by directly sampling z from its marginal distribution (integrating over w).
Logistic Regression: There are 3 strategies implemented for sampling from the Logistic model. The first strategy (logist2SampleMH.m) uses the Metropolis-Hastings algorithm outlined in Johnson and Albert ("Ordinal Data Modeling", Springer 1999). The Iteratively-Reweighted Least Squares algorithm is used to find the Maximum a Posteriori (MAP) estimate of w, and this value is used to initialize the Markov Chain. The Asymptotic Covariance Matrix and an adaptively updated kernel width parameter are used to make proposals.
The 2nd strategy for the Logistic model (logist2Sample.m) is the Logistic variant of the Holmes and Held Probit Regression sampler. Rather than having a unit variance as in the Probit model, in the Logistic model the variances of the z variables lambda are obtained in this model by sampling from a Kolmogorv-Smirnov distribution. This block-Gibbs sampler updates z and w jointly conditioned on lambda (as in the Probit model), then samples lambda conditioned on z and w.
The 3rd strategy for the Logistic model (logist2Sample2.m) is the 2nd block-Gibbs sampling strategy of Holmes and Held. In this second approach, z and lambda are updated jointly given w (z is sampled from a truncated Logistic distribution), then w is sampled conditioned on z and lambda.
Sparse Logistic Regression: A 4th strategy was implemented for a slightly different Logistic model (logist_FS_Sample.m). In this model, we have an additional set of variables gamma that indicate whether a variable is included in the model. The effect of this is that each sample only depends on a subset of the variables, and sampling gamma lets us examine a posterior distribution over whether each variable is 'relevant' to the classification. This function implements the method described in Holmes and Held, which augments the 2nd Logistic strategy above with reversible-jump trans-dimensional moves to update gamma.
Also included is IRLS code that returns the MAP
estimate of the Logistic model (and optionally the Asymptotic Covariance matrix). This code has the
interface:
w = L2LogReg_IRLS(X,y,v)
The complete set of .m files are available here. The report for this class project is available here. Some of the samplers also use RANDRAW.
The blogreg package contains many sub-directories that must be present on the Matlab path for the files to work. You can add these sub-directories to the Matlab path by typing (in Matlab) 'addpath(genpath(blogerg_dir))', where 'blogreg_dir' is the directory that the zip file is extracted to.
Note that a bug in the sampleLambda.m function was fixed on July 18, 2013 (thanks to Sharon Chiang for pointing this out).