B.Sc. (Hons), University of British Columbia (1986); M. Math., U. Waterloo (1988); Ph.D., U. Maryland, College Park (1992); Assistant Professor, University of British Columbia (1992 - 1997); Associate Professor, University of British Columbia (1997 - 2002); Professor, University of British Columbia (2002 - ).
While having varying degrees of success, the data mining tools developed thus far, by and large, share the following computational model: the tools would take some inputs from the user, then crunch away to find the patterns the tools were designed for, and at the end return some answers to the user. The key problems here are that the user is not allowed to participate in the discovery process, and that ultimately the user often cannot relate to the answers, and is left wondering what the so-called discovered knowledge is.
To provide for the most effective data mining from large databases, we believe that (a) we should recognize that data mining is a multi-step process, and that (b) the human user must be allowed to be front and centre in the mining process. We aim to develop tools that strike a careful division of labour between the computer and the human user. Specifically, the computer does what it can do best, such as counting, aggregation and searching large databases; and the human user does what he can do best, such as abstracting, hypothesizing and focusing. To achieve this fine division of labour, our data mining tools must provide feedback to the user frequently, incorporate user guidance in the computation, and be very efficient (as real-time as possible) to engage the user. Based on these principles, my data mining research has four specific technical focuses: (a) the development of tools for constraint-based mining; (b) the development of a unified model and algebra for analysis and mining; (c) performance optimization; and (d) the development of new data mining capabilities, such as outlier detection and fascicle compression. See the affiliated web pages for more information.
My bioinformatics research focuses on problem domains where intelligent decision support and knowledge discovery tools can assist in the understanding of genomic and biomedical data. Specifically, by applying data mining techniques, we link clinical and genomic data to assist cancer researchers in obtaining profiles for cancer in the gene expression level. The biological questions we focus on are: (a) whether there are subtypes of cancer detectable at the gene expression level; (b) whether cancer of one tissue type, say A, is closer to cancer of tissue type B, than to cancer of tissue type C; (c) whether, for a given tissue type, there are signature genes for cancer. Apart from cancer analysis, we develop infrastructure to facilitate cancer researchers for their analyses. One example is the gene expression analyzer we developed. The analyzer sits on top of a relational database management system, and provides both querying and data mining capabilities for gene expression data. See the affiliated web pages for more information.
With respect to my research interest on multimedia data management, my research focuses on providing database support for non-traditional visual-based data that may be in the form of still images or motion videos. For still images, our focus is on image querying and indexing. For motion videos, our focus is on the analysis of trajectories of objects extracted from the videos. Research of this kind is important to surveillance applications, sports and entertainment applications, etc. See the affiliated web pages for more information.
Carenini, G., Ng, R. and Zhou, X. ?Summarizing Emails with Conversational Cohesion and Subjectivity?, Proc. the 46th Annual Meeting of the Association for Computational Linguistics, June 2008.
Chari, R., Lonergan, K., Ng, R., MacAulay, C., Lam, W. and Lam, S., ?Effect of Active Smoking on the Human Bronchial Epithelium Transcriptome,? BMC Genomics, 8:297, pp. 1-13, August 2007.
Cohen-Freue, G., Hollander, Z., Ng, R. et al. ``MDQC: a new quality assessment method for microarrays based on quality control reports,? Bioinformatics Journal, 23, 23, pp. 3162-3139, 2007.
Cheung, K.J., Shah, S., Ng, R. et al. ?Genome-wide profiling of follicular lymphoma by array comparative genomic hybridization reveals prognostically significant DNA copy number imbalances,? To appear in: the Blood Journal.
Shah, S., Lam, W., Ng, R. and Murphy, K., ``Modelling Recurrent DNA Copy Number Alterations in array CGH Data,'' Bioinformatics Journal, 23, 13, pp. i450-i458, August 2007.