Database Usage in Steerable Multidimensional Scaling

CPSC 533C Project Proposal

Allan Rempel - agr@cs.ubc.ca

November 4, 2005

Domain, Task and Dataset
Personal Expertise
Proposed Infovis Solution
Scenario of Use
Proposed Implementation Approach
Milestones
References

Domain, Task and Dataset

Multidimensional Scaling (MDS) concerns itself with representation of high-dimensional (multi-variate) data sets in a 2D or 3D form that can be displayed in an understandable way on a 2D computer monitor. A straightforward approach would be to simply geometrically project that high-dimensional data into a 2D plane. However, the techniques presented in the literature use the high-dimensional distance (or dissimilarity) between pairs of data elements, rather than the data elements themselves, to determine their presentation in the 2D space.

Previous research has resulted in the development of the MDSteer++ system, which allows users to steer the scaling process to follow the most interesting directions first, and provides techniques to place small sets of points into bins for further processing in an effort to efficiently provide better information progressively as the system runs. Theoretically, the system is able to handle over one million points. [2] However, the system is practically limited to the number of points that are able to fit into the working memory of the computer on which it runs. For larger data sets, performance would be expected to decrease precipitiously. The data set used thus far in existing tests of MDSteer++ is the Lahman baseball archive; however, other larger data sets will be of interest as well for the purposes of this project.

Personal Expertise

I have 22 years of computer programming experience and 15 years in C/C++, which is the language in which MDSteer++ is written. Of that, 10 years is in industry, and the remainder is university or personal experience. Most of my experience is with a variety of flavours of unix, with SGI IRIX and Linux being the most recent. On those platforms, I also have several years of experience writing database code, particularly MySQL in C++ with Qt. My information visualization expertise is limited to the CPSC 533C class for which this project is being developed.

Proposed Infovis Solution

I plan to modify MDSteer++ to incorporate the use of a MySQL database for data storage, so that not all data needs to be resident in memory. I plan to use the same data set used in [1], the Lahman baseball archive. I also plan to analyze runs of the software with and without the modifications, on data sets of different sizes, to see whether databases can buy us some scalability when we run into data sets that exhaust the available memory, and if so, what the costs of that scalability are and at what point the benefits of using a database outweigh the costs. In addition, I intend to use another larger data set, yet to be determined, which would exceed the memory capacity of a typical computer on which MDSteer++ would run.

The basic (empty) MDSteer++ main user interface window is shown below, next to an image from [1] that shows what the main window looks like when the program is running on a sample data set:

Scenario of Use

The use scenario will be the same as it currently is for MDSteer++. The user runs the MDSteer++ executable on a particular data set and then watches while the system places the points in the window in accordance with the algorithm. The system is interactive in that the user can click on a region of the MDSteer++ window to steer the computation in the direction that the user is interested in.

One additional feature is that there will be menu options provided to allow the user to obtain a data set from an existing database server and table or set of tables.

More information about the use of the system is available in the README file [2].

Proposed Implementation Approach

I plan to use the Qt library, which has a good MySQL implementation which should facilitate the development of the database code in MDSteer++. I have already gotten the system to run on the SuSE Linux machines in the terminal rooms (CS 106, CS 306) which I expect to use as my computing platform. As MDSteer++ is written in C++, that is the language I will use as well.

Milestones

  1. Get MDSteer++ running under Linux. (Already done.)
  2. Obtain information on how to use MySQL (existing servers, databases?) in this computing environment.
  3. Learn MDSteer++ code base and make appropriate modifications.
  4. Run tests, gather data, and analyze results.
  5. (Outside the scope of the project for this class): Fold results into [1] and submit for publication.

References

  1. D. Westrom, T. Munzner, and M. Tory. Progressive Binning for Steerable Multidimensional Scaling. Unpublished, 2005.
  2. Authors Unspecified. A Guide to Using MDSteer++ Alpha Release (README file), Version 0.5, February 18th 2005.
  3. M. Williams and T. Munzner. Steerable, Progressive Multidimensional Scaling. In Proc. IEEE Symposium on Information Visualization, pages 57-64, 2004.
  4. F. Jourdan and G. Melançon. Multiscale hybrid MDS. In Intl. Conf. on Information Visualization (London), pages 338-393, 2004.