A Guide to Using MDSteer++ Alpha Release

Version 0.5
February 18th 2005

1. Intro

Current implementations of Multidimensional Scaling (MDS), an approach that attempts to best represent data point similarity in a low-dimensional representation, are not suited for many of today's large-scale datasets. We present an extension to the spring model approach that allows the user to interactively explore datasets that are far beyond the scale of previous implementations of MDS.

MDSteer++ is a steerable MDS computation engine and visualization tool that progressively computes an MDS layout and theoretically handles datasets of over one million points. Our technique employs hierarchical data structures and progressive layouts to allow the user to steer the computation of the algorithm to the interesting areas of the dataset. The algorithm iteratively alternates between a layout stage in which a sub-selection of points are added to the set of active points affected by the MDS iteration, and a binning stage which increases the depth of the bin hierarchy. This binning strategy allows the user to select onscreen regions of the layout to focus the MDS computation into the areas of the dataset that are assigned to the selected bins. The main contribution of MDSteer++ lies in progressive binning. Previous versions conduct a global pass over all points during the binning stage, whereas MDSteer++ sorts points on an as-needed basis. This should allow MDSteer++ to scale to a much higher cardinality and dimensionality.

For a paper discussing the previous version of this program (from which the above introduction borrows heavily) go here. MDSteer++ is very similar in design to the original, so read it if you are looking for something more in depth than this cursory guide. A paper on MDSteer++ is in preparation for the IEEE Symposium on Information Visualization 2005.

2. Algorithm

After loading a dataset and starting a layout, the system randomly selects a small subset of points from the dataset and assigns them 2D position by randomly placing each data point in an initial region. Normal MDS computations are then run on this set of points until it is determined that the layout is no longer improving. At this time another small subset of the remaining pool of points is randomly chosen to be introduced into the system. We then use the 2D locations of the points already in the layout and a hierarchy of data structures we call bins to find initial positions for each point of the new sample.

2.1 Bins

As the layout progresses and more points inhabit the layout, we subdivide the 2D space into rectangular regions we call Bins. These provide a method of focusing computations in selected bin regions as well as aid in placing new points we wish to introduce into the layout.

There are two stages to modeling a dataset, layout and binning. The layout stage executes the MDS algorithm on the points inhabiting selected bins. During the binning stage a number of things happen. Points that have wandered out of their original bin are moved into their new containing bin, bins that have become too full are subdivided with their data points divided between the new sub bins, and as we proceed to the next layout stage new points are introduced into any selected bins.

Selected bins can become inactive if the system determines that there are no more new points to place in this region. On the next layout stage this bin will be removed from the set of active bins and will appear with a greyed background. These bins can still be re-activated for a short number of layout stages if re-selected by the user.

2.2. Global Pass

To help with a problem we've dubbed "crusting", where a visible build up of data points will appear in inactive bins that are next to long-active bins, we've introduced a global pass every X iterations that lasts for Y layout stages. During this global pass all bins are activated, even inactive bins, and so all points are activated. As of writing this document X = 1000 and Y = 2.

2.3. Anchoring & Permanent Anchoring

A relatively new feature where points with low stress are chosen to stop at their position forever is called anchoring. At the end of a layout stage, we choose a random sample of points from those in the layout and anchor a fraction of the sample with the lowest stress. This helps us to peg down a specific view of the layout. The fraction of points from this sample is adjustable by moving the "Anchor Fraction" slider on the debug panel.

It is also possible to load a set of "permanent" anchor points from a file before starting the layout of a dataset. This allows a consistent layout pattern to persist from one session to the next. Permanent anchor points should be chosen from a dataset very similar to the one being laid out. A permanent anchor point file can be created by saving points after a layout session. The file can be loaded in a subsequent session by providing the path and file name as a command line argument.

2.4. Auto-Run Mode

If the selected region becomes inactive by placing all of its points, instead of falling idle the system will choose the closest nearby selectable bins and activate them. This ensures useful work is always being done near the area the user is interested in, and will eventually layout the entire dataset if left unattended for a long period.

3. GUI

The main window is displayed in Figure 1. It contains two main features: the menu bar on the top, and the blue layout area on the bottom.

Figure 1: Main Window

3.1. Menu Bar

The menu bar contains a number of options for interacting with the system. Starting from left to right:

The File menu contains 6 options.
Open Table File: opens a file chooser to select a dataset to layout. Use this option if your data is in the form of a table in CSV format.
Open Distance File: opens a file chooser to select the distance matrix to use. Use this option if distances between data points have already been calculated and stored in a distance matrix. After choosing a distance matrix file, the program asks whether to load a point attribute file to colour the points. If you wish to colour the points according to some extra information, select "yes" and then choose the CSV file.
Save Anchor Points: Prompts user with a popup asking for the percentage of anchor points you wish to save. After a selection is made a file chooser is launched to select a location to save the anchor file. The program will then save the lowest stress points to a file in CSV format. These can be later read in as permanent anchors by providing the path & file name as a command line argument when MDSteer is run.
Clear Model: Clears the layout completely.
Help: Displays this help file.
Exit: Exits the program.

File properties panel: This displays three pieces of information about the data set currently being modeled: the file name, the dimensionality, and cardinality. This panel appears upon loading a dataset.

Start / Stop / Step: After a table file has been loaded, the model can be controlled using these buttons. Start button will begin running the model until completion or one of the other buttons is pressed. Stop button will halt progression of the model, while still allowing bin selection. Step button will continue the model until the next binning stage.

Bin Depth scale: To help users realize how deep a particular bin is the in the bin tree a colour scale has been added. The top level root bin is coloured yellow while very deep bins will be coloured green with intermediate depth bins coloured accordingly.

3.2. Layout Area

This is the area where data points are represented in two dimensional space. Once modeling has begun on a loaded dataset and this region becomes more and more populated with data points, we continually divide the layout space into what we call bins. Bins are regions you can select to focus computations on, and in this way "steer" the program. In only those bins will the system allow points to participate in the MDS algorithm (i.e. move around) and the introduction of new points.

If a bin has a greyed background this indicates that the system has determined that there are no more points that can be placed in this bin, and it will be removed from the selected bin set after the next binning stage. Points in these bins are not completely finished in the model however. Inactive bins can be activated for a short number of iterations during a global pass or if explicitly selected by the user. As well, data points in inactive bins can always participate in random neighbour samples.

3.3. Debug Panel

The third main GUI feature is the debug panel, shown in Figure 2. This is a popup window that is displayed once a dataset has been loaded.
Figure 2: Debug Panel

The Debug Panel contains 7 sections:

The topmost section titled "Top Level Storage Data" displays information about the main array. Placed indicates the number of data points that have been placed, or given a 2D location in the model. Potential indicates how many points have been partially sorted in the bin tree but not yet given a 2D position. Inactive indicates the number of data points that have not been introduced into the model.

The Anchoring Fraction slider adjusts what fraction of the stress sample set is anchored at a binning stage. See section one for more information on anchors.

Dimension Colouring selects the dimension to colour on. Lowest value data point will appear coloured a dark brown and highest value data points will appear bright yellow.

The See All button will resize the layout area to fit the top level bin, minimizing blank space.

The Point Class Colouring button currently colours points added to this layout stage as green, and re-colours them white if they cross out of their original placement bin. Anchored points are coloured blue and permanent anchors are coloured red. All other placed points are coloured pink. (Intended for developers).

The Rep Point Colouring button only draws representative points. This was used to make sure representative points are being updated properly. You may not see much in this mode as representative points are often occluded by bin boundary lines. (Intended for developers).

The Reset button will begin the model again from the start. This does not require the data to be loaded from disk again, so use this instead of reloading the dataset if you want to run multiple layouts of the same dataset (especially if modeling a large dataset).

4. Commands

Besides the GUI features above, there are a number of other was to interact with MDSteer++.

4.2. Command-line arguments

The path and filename of a CSV file containing permanent anchor points may be provided as a command line argument. This is currently the only way to load permanent anchor points. If no filename is given, permanent anchors will not be used.

4.2. Mouse Commands

Left Click - Selects a bin to activate on the next layout stage. If no region is selected (simple point click) information about an underlying bin will be displayed in the standard out.

Left Click + Drag - Creates a selection region, all bins intersecting this region will become activated on the next layout stage.

Right Click + Drag - Moves the layout area according to mouse drags.

Middle Click + Drag - Zooms the layout area in and out.

Mouse Wheel In - Zooms the layout area in.

Mouse Wheel Out - Zooms the layout area out.

4.3. Key Commands

(+) means press both keys at the same time to carry out the described operation.

(->) means press one key then the next to carry out the described operation.

--------------------------

Shift + Bin Selection - Selects multiple bins to activate on the next layout stage.

Space Bar - Holding space bar while selecting a region (left click and drag with mouse) will select an enclosed point. Selection works for one data point currently, so enclosing a group of points will return the "first" of those points. The selected point will be coloured cyan. The selected point’s current neighbours will also be highlighted (in different colours). Information about this point (and its neighbours) is printed to the standard out.

Return -> Number -> Return - Another method for selecting points is to press Enter, then input the id number of the point you wish to select, then press enter again. This will mark that data point as the selected point and display information about it when it becomes placed in the layout. If no number is entered, data point zero is chosen (intended for developers).

Return -> Number -> D - This will display the distance between the currently selected point and the point specified by Number (intended for developers).

Comma - This will colour data points by the previous dimension. So if the dimensions are X,Y,Z and currently colouring on dimension Y, pressing comma will now colour by dimension X.

Period - This will colour data points by the next dimension. So if the dimensions are X,Y,Z and currently colouring on dimension Y, pressing period will now colour by dimension Z.

G - This toggles bin lines on and off, think Grid.

F - Toggles anchoring of selected point.

I - Perform one iteration of the model.

Command + O - Opens a file chooser to select a table file.

Command + D - Opens a file chooser to select a distance matrix.

Command + S - Prompts the user for a percent of the available anchor points, then saves their data values and positions to a file.

Command + C - Clears the model, removing the dataset from memory.

Command + H - Displays this help file in a popup window.

Command + Q - Quits the program.

5. File Formats

5.1 Table Data File (Stores a table-type dataset)
Comma-separated values (CSV) format, with 3-4 special lines at the top:

Line1: Name of each dimension
Line2: Data type of each dimension (currently "double" and "string" are supported)
Line3: (optional) Whether each dimension will be used in layout calculations ("normal") or only for colouring points ("colour"). Note: string dimensions will always be used for colouring only.
Line 4: Number of items in this data set. Note: this line must not contain any commas.
Remaining lines: 1 line per data point with comma-separated data values.

5.2 Distance Matrix File (Stores pre-calculated distances between data points)

First line: number of points / IDs
Remaining lines: one line per distance value with format
ID1 ID2 distance

Notes:
- IDs are expected to be integers ranging from 1 to the total number of IDs, with no missing numbers.
- With the exception of the first line, order of lines in the file does not matter.

5.3 Colouring Attributes File (Stores additional information about points whose distances are stored in a distance matrix. Typically used to colour and identify points.)

Use the same CSV format as for the Table Data File (section 5.1).

Notes:
- Rows in the file must be in ascending order by item ID.
- All dimensions will be used for colouring only.
- Including the ID as a dimension in the file is optional.

5.4 Anchor Point File (Stores data values and positions for permanent anchor points)

Use the same CSV format as for the Table Data File (section 5.1), with 3 extra dimensions:
1st dimension: Permanent ID of the point.
Last 2 dimensions: x and y positions of the point.

This file type will be created automatically when using the "Save Anchor Points" feature on the File menu.

6. Known Bugs and Issues

Using a filename longer than ~12 characters will overflow the file properties panel and hide the Start, Stop, and Step buttons.

Shrinking the window too much will hide the Start, Stop, and Step buttons.