Data Management and Mining Lab

Spatial Data Mining

Spatial data mining is the branch of data mining that deals with spatial (location) data. Consider a map of the city of Vancouver containing various natural and manmade geographic features, and clusters of points (where each point marks the location of a particular house). The houses might be noteworthy because of their size, their historical interest, or their current market value. Clustering algorithms exist to assign each point to exactly one cluster, with the number of clusters being defined by the user.

Clustering is a technique that is quite useful in spatial data mining applications. It divides the initial set of objects into a number of non-overlapping subsets, in order to identify classes or groups of objects whose locations (in some k-dimensional space) are close to each other. This requires a suitable similarity function. Euclidean distance is a natural choice of similarity function for spatial data mining. There are many clustering approaches (e.g., hierarchical, partitioning, k-means, k-medoids), and we add that efficiency is very important if the clusters contain many points.

One area of our research identifies spatial relationships between clusters and features. For example, some of the houses in a particular cluster may border a major park or may be near some important geographic feature(s). The objective is to identify likely relationships. The information is ranked, so that qualitative information is presented first. For example, trivial information such as "50% of the houses (in a particular cluster) are within 5 kilometres of an elementary school" is not very interesting because there are many schools in a major city. It is much more interesting to note that "64% of the houses are within 500 metres of John Oliver High School" or that "22% of the houses border Queen Elizabeth Park". Such information could be of value to realtors, investors, or prospective home buyers; however, the techniques employed can also apply to other spatial applications (e.g., satellite images, photographs, oil and gas exploration).

Given a cluster of houses, how can we tell which features are close? This problem is not trivial. There may be tens of thousands of features to consider and the input features can vary significantly from user to user. Furthermore, many map datasets are represented in free format without accompanying indexes. We need to be able to detect relationships among large numbers of objects without incurring significant overhead.

Aggregate Proximity Relationships

One area of our research involves the problem of efficiently determining aggregate proximity relationships in spatial data mining. We mention the word "aggregate" because most clusters are unlikely to have a uniform distribution of houses. It makes sense to assign a greater weight to houses nearer a relevant feature's boundary. This is a form of nearest neighbour search. Although index structures exist to provide fast look-up, significant pre-processing is required (e.g., O(n log n) for tree structures or O(k**2 n log n) for order-k Voronoi Diagrams to find the k nearest neighbours). Such pre-processing makes index construction questionable in a data mining context because the number of features (n) can be very large.

We have implemented a (small constant) linear time algorithm (Algorithm CRH) to find approximate proximity relationships using filtering, memoization, buckets, and selection. Our filtering approach is based on approximations to the features and clusters via successive convex approximations (i.e., circles, rectangles, and convex hulls). We do not require the use of indexes, and have shown that excellent performance can be obtained on large ad hoc datasets, without constructing indexes. Our approach supports scalability and incrementality, both of which are important properties in spatial data mining.

Concept Generalization

This area of research expands on our work involving aggregate proximity relationships. Given n input clusters, we want to associate the clusters with classes of features (e.g., educational institutions which in turn may be comprised of grade schools and post-secondary institutions ... and grade schools which may be comprised of public and private institutions ... and private grade schools which may be comprised of boys' schools and girls' schools). Specifically, we want to find classes of features that are in close proximity to most (or all) of the clusters. For example, if 2 of 3 clusters are close to girls' private schools, and the other cluster is close to a boys' private school, then we can generalize these observations to the fact that all 3 clusters are close to some private grade school. Alternatively, given two input clusters, we want to find discriminating classes of features that distinguish one cluster from the other.

We have proposed two algorithms that use concept generalization to find such classes of features:

Algorithm GenCom, which is used to extract commonalities. For example, if n expensive housing clusters are given as input, GenCom may discover the general pattern that they are all very close to the shoreline. Moreover, using concept generalization, GenCom considers not only individual features, but also classes of features in pattern extraction. For instance, while it may not be true that every input cluster is close to one particular golf course, GenCom may find out that each cluster is close to some golf course -- not necessarily the same one.
Algorithm GenDis, which is used to find "discriminating" classes of features that distinguish one cluster from another. For example, if an expensive housing cluster and a poor housing cluster are given as input, GenDis may find that "golf course" and "police station" are discriminators, in that while the expensive housing cluster is close to a golf course, the poor housing cluster is not, and that while the poor housing cluster is close to a police station, the expensive cluster is not.

Boundary Shape Matching

We are researching a specific property in knowledge discovery among spatial objects -- namely that of partial boundary shape matching. As above, our focus is on mining spatial data, whereby a large number of objects called features (represented as polygons) are compared with one or more point sets called clusters. Specifically, a cluster of points is compared to a large number of natural or man-made features to detect partial or total matches of the facing boundaries of the cluster and feature. We use a geometric construct called an alpha-shape to characterize the "shape" of an arbitrary cluster of points, thereby getting a set of edges denoting the cluster's boundary. Our work results in an approach for detecting a boundary shape match between the facing curves of the cluster and feature. Furthermore, the value of such a match can be quantified.