CS 540 Term 2 suggested projects

Ratings: 1 = easy, 2 = medium, 3 = hard. For a 2-person project, please pick a project with rating 2 or 3.
Some projects refer to ESL, which is the book "Elements of statistical learning" 2nd edn.

Rating Title Person Goal

. . . .

1 MLcomp . mlcomp.org is a website that allows one to fill in the matrix of methods x datasets in a distributed, web-based fashion. Specifically, any user can upload a method (for classification, regression, collaborative filtering, etc.) Any user can also upload a dataset. And any user can request that method m be run on dataset d, using Amazon's EC2 cloud computing architecture. The resulting performance measure, R(m,d), will then be computed and displayed on the web page for all to see. Matt has written a pmtk function "mlcompCompiler" that can bundle up any set of pmtk functions into the required format. These can then be run in octave on mlcomp's servers. This opens the door to systematic comparison of our methods compared to other people's. The problem is, right now, each method is only tested on a single train/test split of the data, so differences may not be statistically significant. The goal of this project is to try to construct a statistically significant MxD matrix of methods by data, where the methods contain representative implementations of the top 5-10 standard techhniques, such as linear and kernelized logistic regression with L2 and L1 regularization, naive Bayes, SVMs, etc. The datasets should contain a variety of datasets (about 10 in total), varying in number of cases N and features D and classes C. Do this also for regression problems. For each dataset, create 5 different train/test splits from the entire original dataset. Upload them all separately. Run all methods on these train/test folds. Extract resuls from web (preferably automatically!). Make boxplots and compute mean and standard error and make latex table. Repeat until some trends appear (e.g., does method A systematically dominate method B on data of type C?) The goal is to make this publishable in the book.

2 Space-efficient inference for HMMs . The goal is to implement (in Matlab/pmtk) the algorithm described in Space-efficient inference in dynamic probabilistic networks, Binder, Murphy, Russell, IJCAI'97. You can focus on the special case of HMMs, rather than general DBNs. (The special case is desribed in Exact alpha-beta computation in logarithmic space with application to map word graph construction, Zweig, ICSLP'00.) Your algorithm should have exactly the same input/output behavior as forwards-backwards, but should take log(T) space instead of O(T) space. You should demonstrate that your method produces the exact answers but uses less space by making a plot of memory vs sequence length for the 2 techniques.

3 Inference and learning in hidden semi-Markov models Jiarui Ding HSMMs are widely used in computational biology, e.g., for gene finding and chip-seq analysis (see e.g., A Hierarchical Semi-Markov Model for Detecting Enrichment with Application to ChIP-Seq Experiments, TR 2009). The goal is to implement (in matlab/pmtk) the forwards-backwards algorithm for HSMMs using the numerically stable algorithm described in Estimating hidden semi-markov chains from discrete sequences Guedon, 2003. You should then implement the M step for the standard waiting time distributions; M step code for the emission model should be the same as in a regular HMM. Make a demo of inference and learning on a toy data set. Bonus (making it rating 3): make it work on some real data (e.g., vision, biology or speech). Note: no knowledge of biology is required for this project! A good review article on HSMMs is here.

3 Hybrid EM/gradient methods . The goal is to implement (in matlab/pmtk) the algorithm in Optimization with EM and Expectation-Conjugate-Gradient, Salakhutdinov, Roweis, Ghahramani, ICML2003, and then reproduce 2 or 3 of their examples and figures. Be sure to include mixtures of Gaussians and PPCA in your set of examples. Figure out how to do gradient descent for GMMs, including the softmax reparameterizatino and cholesky decomposition of Sigma.

2 Over-relaxed EM Krishna Nand K. The goal is to implement (in matlab/pmtk) the algorithm in Adaptive Overrelaxed Bound Optimization Methods, Salakhutdinov, Roweis, ICML2003, and then reproduce 2 or 3 of their examples and figures. See also Triple jump acceleration of EM.

2 MCMC for horseshoe . Implement the Gibbs sampling method described here. Compare to lasso and other non-convex penalizers such as NEG. Test model selection accuracy on the standard benchmark synthetic datasets such as those in the original lasso paper. Test predictive accuracy on some real data. Bonus (making it rating 3): extend from linear regression to probit regression.

2 MCMC for elastic net . The goal is to implement (in matlab/pmtk) the Gibbs sampler in Bayesian elastic net, Bayesian Analysis 2010. Then reproduce all their examples and figures. A Matlab function to sample from the GIG distribution is here.

3 VB for elastic net . The goal is to implement (in matlab/pmtk) the variational Bayes method for fitting the elastic net model described here. This model is slightly different than the one described in "MCMC for enet", as is the experimental validation. Ideally two people would work on both of these, so the two methods could be compared on the same data, in addition to comparing to regular enet and lasso.

2 Feature selection challenge . The goal is to try to do well on the 4 datasets in the nips 2003 feature selection challenge using some of the methods we talked about in class. Note that we mostly focussed on linear models, which is probably not adequate. You could try doing basis function expansion and using group lasso. Or you could try neural nets and ARD (see ESL p410).

2 Learning "causal" structures with hidden variables Simon Aloysius The goal is to reproduce (using pmtk) figs 26.2 and 26.3 from my book (2jan edition). These figures are taken (with permission) from "A Bayesian approach to Causal Discovery". Fig 26.2 is easy: just enumerate all DAGS on 5 nodes and comptue their marginal likleihood. Fig 26.3 requires implementing the Cheeseman-Stutz approximation.

2 VB for scoring DAGs with missing data Robert Tseng A more accurate alternative to cheeseman-stutz is to use variational bayes. The goal is to implement in pmtk the method described here, and reproduce their examples. Ideally you would partner with the person doing the causal learning project and compare VB and CS on the same data.

1 Mclust in matlab . The mclust package for R implements EM for mixtures of Gaussians, but has various bells and whistles and lots of good demos. The goal is to reproduce this functionality, and the demos, in pmtk.

2 Online EM . The goal is to implement (in pmtk) the online EM algorithm described here, and to create an example of sequentially fitting a mixture of Gaussians as the data streams in. (The online solution should converge to something similar to the batch case.) Try alos some of the more interesting examples in this recent paper here. For some related work, and useful background reading, see this UBC MSc thesis here.

3 Structural pattern recognition Hai-Feng Kao, Zhenyu Guo Participate in the synthetic visual reasoning task. This requires knowledge of computer vision.

2 Clustering flow-ctyometry data Hannes Bretschneider, Andrew Roth Participate in this challenge. This is simialr to the mixtures of Students hoemwrowk exercise. I propose you use variational bayes instead of EM. Reqires some knowledge of biology.

3 Online Bayesian inference for generative classifiers . The goal is to implement the method and reproduce the results in this paper. The model (called Aclass) simply represents each class conditional density using an "infinite" mixture model. This is non-parametric but can handle missing data etc, unlike SVMs. The inference method is sequential Monte Carlo, so it can learn more complex models as the data streams in. (There is a closely related paper where they use offline MCMC available here. They have matlab code for this here.

3 Empirical Bayes for nearest shrunken centroid . The nearest shrunken centroid classifier has several aspects that are hard to justify, such as: (1) we could not justify why mj=xbar(j) was optimal without making an apparently unreasonable assumption; (2) setting $\tau_{cj} = \frac{s_j}{\lambda \sqrt{N_c}}$ in the Laplace prior seems ad hoc; (3) using $\ell_c = 1/\sqrt{N_c} - 1/\sqrt{N}$ instead of $1/\sqrt{N_c}$ seems ad hoc; (4) using s_j^2 + s_0 seems ad hoc. Try to eliminate this ad hocery by deriving an empirical Bayes approach to estimating all the hyper-parameters. Implement your method and compare it to the original approach on a few datasets (including SRBCT).

1 Convolutional NNs on mnist Simon Hastings The goal is to reproduce the classification performance of the convolutional neural net described here which is based on Simard03, but uses the conventional train/ test split to get 99.26% accuracy on MNIST digits. Use one of the two existing Matlab packages for convolutional neural nets, namely myCNN and CNN class (the latter currently needs the neural net toolbox). Most of the work seems to have been done for you already. Your goal is to make sure the code runs (ideally try both packages) and that you are able to retrain from scratch and reproduce the performance above. Try to reproduce fig 14.7 of my book (2jan edition), which plots the 82 incorrectly labeled test digits.

2 Sparse Bayesian multi-task learning . There are several goals for this project. First, apply the methodology from hw8 to the conjoint analysis problem; use the "computer buyers" data from Lenk96 (email me for the data), but follow the experimental methodology of Argyriou08. Second, use mixtures of sparsity-promoting priors (e.g., Laplace or NormalGamma) instead of mixtures of Gaussians, so as to perform simultaneous feature selection. Third, try other inference methods besides EM, such as variational Bayes or MCMC, as in Xue07.

2 Variational Bayes for non-orthogonal PCA . Implement (in pmtk) the VB algorithm of Bishop ICANN'99. This is a special case of factor analysis where the measurement noise is isotropic. But it is not quite the same as PCA since W is not constrained to be orthogonal. It is important that your method evaluate the lower bound, so we can autoamtically determine the number of latent dimensions. Test your code using ppcaEvidenceDemo (which calls Minka's laplace approximation method). It is simple to extend this VB method to handle missing data. Matlab software implementing this extension is available here. However, it is horrible code: he has hard coded special values like "900" or "999" to indicate missing values, instead of using NaNs. Please reimplement from scratch. Also, make it return the lower bound so we can do model selection and check for proper convergence. (See mixGaussVbFit for an example of how the code should look.) Test your code on pccaDemoOilFlowMissing. (Some other Matlab/C code for the VB method with missing data is available here. This page also contains a techreport with lots of details.)

3 Semi-supervised learning with MCMC . Reproduce the examples in Liang07, where they show how to combine unlabeled and labeled data in simple settings

3 Biclustering using VBEM . Derive a VB and/or EM algorithm analogous to the gibbs sampler in this paper and use it to reproduce their synthetic demos.

2 Empirical Bayes for Markov models of language . The goal is to implement the EB method of Mackay and Peto and apply it to some natural language data. You will likely find Tom Minka's fastfit library, for optimizing the parameters of a Dirichlet, very useful. Some other reading that might be helpful is Hanna Wallach's thesis

1-2 Isolated word speech recognition using HMMs and CRFs . The demo "isolatedWordClassificationHmmDemo" in pmtk3 trains 2 separate HMMs, one on a sequence of someone saying "four", and another on someone saying "five". It then parses a test sequence by running the Viterbi algorithm. The first task (which is trivial) is to modify the code so it runs with pmkt3 (it currently only works with pmtk1). The second task (which is also fairly easy) is to use Mark Schmidt's UGM Matlab code to train up a linear CRF to perform the same task. The third task (which changes the rating of this project from 1 to 2) is to acquire some larger speech data sets (e.g., the TIMIT digit dataset) and re-do your experiments on that. In this case, you may need to make the HMM emission model a mixture of Gaussians instead of a single Gaussian, and you may need to make the CRF do feature expansion before using the logistic regression local potentials.

2 N-best list for HMMs . Implement the sequential N-best list method of Nilsson 2001 in pmtk3. Check correctness by comparing to exhaustive enumeration of all possible paths (this will only be possible for short sequences). Make a demo illustrating the usefulness of the N-best list (as opposed to vanilla Viterbi) on any data set of your choosing.

Rating	Title	Person	Goal
.	.	.	.
1	MLcomp	.	mlcomp.org is a website that allows one to fill in the matrix of methods x datasets in a distributed, web-based fashion. Specifically, any user can upload a method (for classification, regression, collaborative filtering, etc.) Any user can also upload a dataset. And any user can request that method m be run on dataset d, using Amazon's EC2 cloud computing architecture. The resulting performance measure, R(m,d), will then be computed and displayed on the web page for all to see. Matt has written a pmtk function "mlcompCompiler" that can bundle up any set of pmtk functions into the required format. These can then be run in octave on mlcomp's servers. This opens the door to systematic comparison of our methods compared to other people's. The problem is, right now, each method is only tested on a single train/test split of the data, so differences may not be statistically significant. The goal of this project is to try to construct a statistically significant MxD matrix of methods by data, where the methods contain representative implementations of the top 5-10 standard techhniques, such as linear and kernelized logistic regression with L2 and L1 regularization, naive Bayes, SVMs, etc. The datasets should contain a variety of datasets (about 10 in total), varying in number of cases N and features D and classes C. Do this also for regression problems. For each dataset, create 5 different train/test splits from the entire original dataset. Upload them all separately. Run all methods on these train/test folds. Extract resuls from web (preferably automatically!). Make boxplots and compute mean and standard error and make latex table. Repeat until some trends appear (e.g., does method A systematically dominate method B on data of type C?) The goal is to make this publishable in the book.
2	Space-efficient inference for HMMs	.	The goal is to implement (in Matlab/pmtk) the algorithm described in Space-efficient inference in dynamic probabilistic networks, Binder, Murphy, Russell, IJCAI'97. You can focus on the special case of HMMs, rather than general DBNs. (The special case is desribed in Exact alpha-beta computation in logarithmic space with application to map word graph construction, Zweig, ICSLP'00.) Your algorithm should have exactly the same input/output behavior as forwards-backwards, but should take log(T) space instead of O(T) space. You should demonstrate that your method produces the exact answers but uses less space by making a plot of memory vs sequence length for the 2 techniques.
3	Inference and learning in hidden semi-Markov models	Jiarui Ding	HSMMs are widely used in computational biology, e.g., for gene finding and chip-seq analysis (see e.g., A Hierarchical Semi-Markov Model for Detecting Enrichment with Application to ChIP-Seq Experiments, TR 2009). The goal is to implement (in matlab/pmtk) the forwards-backwards algorithm for HSMMs using the numerically stable algorithm described in Estimating hidden semi-markov chains from discrete sequences Guedon, 2003. You should then implement the M step for the standard waiting time distributions; M step code for the emission model should be the same as in a regular HMM. Make a demo of inference and learning on a toy data set. Bonus (making it rating 3): make it work on some real data (e.g., vision, biology or speech). Note: no knowledge of biology is required for this project! A good review article on HSMMs is here.
3	Hybrid EM/gradient methods	.	The goal is to implement (in matlab/pmtk) the algorithm in Optimization with EM and Expectation-Conjugate-Gradient, Salakhutdinov, Roweis, Ghahramani, ICML2003, and then reproduce 2 or 3 of their examples and figures. Be sure to include mixtures of Gaussians and PPCA in your set of examples. Figure out how to do gradient descent for GMMs, including the softmax reparameterizatino and cholesky decomposition of Sigma.
2	Over-relaxed EM	Krishna Nand K.	The goal is to implement (in matlab/pmtk) the algorithm in Adaptive Overrelaxed Bound Optimization Methods, Salakhutdinov, Roweis, ICML2003, and then reproduce 2 or 3 of their examples and figures. See also Triple jump acceleration of EM.
2	MCMC for horseshoe	.	Implement the Gibbs sampling method described here. Compare to lasso and other non-convex penalizers such as NEG. Test model selection accuracy on the standard benchmark synthetic datasets such as those in the original lasso paper. Test predictive accuracy on some real data. Bonus (making it rating 3): extend from linear regression to probit regression.
2	MCMC for elastic net	.	The goal is to implement (in matlab/pmtk) the Gibbs sampler in Bayesian elastic net, Bayesian Analysis 2010. Then reproduce all their examples and figures. A Matlab function to sample from the GIG distribution is here.
3	VB for elastic net	.	The goal is to implement (in matlab/pmtk) the variational Bayes method for fitting the elastic net model described here. This model is slightly different than the one described in "MCMC for enet", as is the experimental validation. Ideally two people would work on both of these, so the two methods could be compared on the same data, in addition to comparing to regular enet and lasso.
2	Feature selection challenge	.	The goal is to try to do well on the 4 datasets in the nips 2003 feature selection challenge using some of the methods we talked about in class. Note that we mostly focussed on linear models, which is probably not adequate. You could try doing basis function expansion and using group lasso. Or you could try neural nets and ARD (see ESL p410).
2	Learning "causal" structures with hidden variables	Simon Aloysius	The goal is to reproduce (using pmtk) figs 26.2 and 26.3 from my book (2jan edition). These figures are taken (with permission) from "A Bayesian approach to Causal Discovery". Fig 26.2 is easy: just enumerate all DAGS on 5 nodes and comptue their marginal likleihood. Fig 26.3 requires implementing the Cheeseman-Stutz approximation.
2	VB for scoring DAGs with missing data	Robert Tseng	A more accurate alternative to cheeseman-stutz is to use variational bayes. The goal is to implement in pmtk the method described here, and reproduce their examples. Ideally you would partner with the person doing the causal learning project and compare VB and CS on the same data.
1	Mclust in matlab	.	The mclust package for R implements EM for mixtures of Gaussians, but has various bells and whistles and lots of good demos. The goal is to reproduce this functionality, and the demos, in pmtk.
2	Online EM	.	The goal is to implement (in pmtk) the online EM algorithm described here, and to create an example of sequentially fitting a mixture of Gaussians as the data streams in. (The online solution should converge to something similar to the batch case.) Try alos some of the more interesting examples in this recent paper here. For some related work, and useful background reading, see this UBC MSc thesis here.
3	Structural pattern recognition	Hai-Feng Kao, Zhenyu Guo	Participate in the synthetic visual reasoning task. This requires knowledge of computer vision.
2	Clustering flow-ctyometry data	Hannes Bretschneider, Andrew Roth	Participate in this challenge. This is simialr to the mixtures of Students hoemwrowk exercise. I propose you use variational bayes instead of EM. Reqires some knowledge of biology.
3	Online Bayesian inference for generative classifiers	.	The goal is to implement the method and reproduce the results in this paper. The model (called Aclass) simply represents each class conditional density using an "infinite" mixture model. This is non-parametric but can handle missing data etc, unlike SVMs. The inference method is sequential Monte Carlo, so it can learn more complex models as the data streams in. (There is a closely related paper where they use offline MCMC available here. They have matlab code for this here.
3	Empirical Bayes for nearest shrunken centroid	.	The nearest shrunken centroid classifier has several aspects that are hard to justify, such as: (1) we could not justify why mj=xbar(j) was optimal without making an apparently unreasonable assumption; (2) setting $\tau_{cj} = \frac{s_j}{\lambda \sqrt{N_c}}$ in the Laplace prior seems ad hoc; (3) using $\ell_c = 1/\sqrt{N_c} - 1/\sqrt{N}$ instead of $1/\sqrt{N_c}$ seems ad hoc; (4) using s_j^2 + s_0 seems ad hoc. Try to eliminate this ad hocery by deriving an empirical Bayes approach to estimating all the hyper-parameters. Implement your method and compare it to the original approach on a few datasets (including SRBCT).
1	Convolutional NNs on mnist	Simon Hastings	The goal is to reproduce the classification performance of the convolutional neural net described here which is based on Simard03, but uses the conventional train/ test split to get 99.26% accuracy on MNIST digits. Use one of the two existing Matlab packages for convolutional neural nets, namely myCNN and CNN class (the latter currently needs the neural net toolbox). Most of the work seems to have been done for you already. Your goal is to make sure the code runs (ideally try both packages) and that you are able to retrain from scratch and reproduce the performance above. Try to reproduce fig 14.7 of my book (2jan edition), which plots the 82 incorrectly labeled test digits.
2	Sparse Bayesian multi-task learning	.	There are several goals for this project. First, apply the methodology from hw8 to the conjoint analysis problem; use the "computer buyers" data from Lenk96 (email me for the data), but follow the experimental methodology of Argyriou08. Second, use mixtures of sparsity-promoting priors (e.g., Laplace or NormalGamma) instead of mixtures of Gaussians, so as to perform simultaneous feature selection. Third, try other inference methods besides EM, such as variational Bayes or MCMC, as in Xue07.
2	Variational Bayes for non-orthogonal PCA	.	Implement (in pmtk) the VB algorithm of Bishop ICANN'99. This is a special case of factor analysis where the measurement noise is isotropic. But it is not quite the same as PCA since W is not constrained to be orthogonal. It is important that your method evaluate the lower bound, so we can autoamtically determine the number of latent dimensions. Test your code using ppcaEvidenceDemo (which calls Minka's laplace approximation method). It is simple to extend this VB method to handle missing data. Matlab software implementing this extension is available here. However, it is horrible code: he has hard coded special values like "900" or "999" to indicate missing values, instead of using NaNs. Please reimplement from scratch. Also, make it return the lower bound so we can do model selection and check for proper convergence. (See mixGaussVbFit for an example of how the code should look.) Test your code on pccaDemoOilFlowMissing. (Some other Matlab/C code for the VB method with missing data is available here. This page also contains a techreport with lots of details.)
3	Semi-supervised learning with MCMC	.	Reproduce the examples in Liang07, where they show how to combine unlabeled and labeled data in simple settings
3	Biclustering using VBEM	.	Derive a VB and/or EM algorithm analogous to the gibbs sampler in this paper and use it to reproduce their synthetic demos.
2	Empirical Bayes for Markov models of language	.	The goal is to implement the EB method of Mackay and Peto and apply it to some natural language data. You will likely find Tom Minka's fastfit library, for optimizing the parameters of a Dirichlet, very useful. Some other reading that might be helpful is Hanna Wallach's thesis
1-2	Isolated word speech recognition using HMMs and CRFs	.	The demo "isolatedWordClassificationHmmDemo" in pmtk3 trains 2 separate HMMs, one on a sequence of someone saying "four", and another on someone saying "five". It then parses a test sequence by running the Viterbi algorithm. The first task (which is trivial) is to modify the code so it runs with pmkt3 (it currently only works with pmtk1). The second task (which is also fairly easy) is to use Mark Schmidt's UGM Matlab code to train up a linear CRF to perform the same task. The third task (which changes the rating of this project from 1 to 2) is to acquire some larger speech data sets (e.g., the TIMIT digit dataset) and re-do your experiments on that. In this case, you may need to make the HMM emission model a mixture of Gaussians instead of a single Gaussian, and you may need to make the CRF do feature expansion before using the logistic regression local potentials.
2	N-best list for HMMs	.	Implement the sequential N-best list method of Nilsson 2001 in pmtk3. Check correctness by comparing to exhaustive enumeration of all possible paths (this will only be possible for short sequences). Make a demo illustrating the usefulness of the N-best list (as opposed to vanilla Viterbi) on any data set of your choosing.