Data Management and Mining Lab

Distance-Based Outliers

Most existing work in data mining has focused on the discovery of patterns. For some applications, however, the patterns are well-established, and it is the exceptions to those patterns that are of interest.

We are performing on-going research on the identification, explanation, and generalization of distance-based outliers (DB-outliers). An outlier is a statistical term for any data value that seems to be out of place with respect to the rest of the data. Formally, given user-defined parameters p and D, and a distance function F, an object O in a dataset T is said to be a distance-based outlier if at least fraction p of the objects in T lie greater than distance D from O.

Our research has been applied to identify outliers among players in the National Hockey League, based on the players' performance statistics. We have also applied our work to stock market, mutual fund, education, insurance, and video surveillance data.

Detailed information about distance-based outliers can be found in:

Edwin M. Knorr and Raymond T. Ng. "Algorithms for Mining Distance-Based Outliers in Large Datasets", Proceedings of the 24th VLDB Conference, New York, August 24-27, 1998, pp. 392-403. Postscript

Edwin M. Knorr and Raymond T. Ng. "Finding Intensional Knowledge of Distance-Based Outliers", Proc. VLDB, Edinburgh, Scotland, September 7-10, 1999, pp. 211-222. Postscript

Edwin M. Knorr, Raymond T. Ng, and Ruben H. Zamar. "Robust Space Transformations for Distance-based Operations", Proc. SIGKDD, San Francisco, August 26-29, 2001, pp. 126-135. Postscript

More information on outlier-detection in video surveillance can be found in:

Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. "Distance-Based Outliers: Algorithms and Applications", The VLDB Journal, 8(3), February, 2000, pp. 237-253. Postscript or Compressed Postscript