Distance-Based Outliers
Most existing work in data mining has focused on the discovery
of patterns. For some applications, however, the patterns are well-established,
and it is the exceptions to those patterns that are of interest.
We are performing on-going research on the identification, explanation,
and generalization of distance-based outliers (DB-outliers). An
outlier is a statistical term for any data value that seems to be
out of place with respect to the rest of the data. Formally, given
user-defined parameters p and D, and a distance function F, an object
O in a dataset T is said to be a distance-based outlier if at least
fraction p of the objects in T lie greater than distance D from
O.
Our research has been applied to identify outliers among players
in the National Hockey League, based on the players' performance
statistics. We have also applied our work to stock market, mutual
fund, education, insurance, and video surveillance data.
Detailed information about distance-based outliers can be found
in:
Edwin M. Knorr and Raymond T. Ng. "Algorithms for Mining Distance-Based
Outliers in Large Datasets", Proceedings of the 24th VLDB Conference,
New York, August 24-27, 1998, pp. 392-403. Postscript
Edwin M. Knorr and Raymond T. Ng. "Finding Intensional Knowledge
of Distance-Based Outliers", Proc. VLDB, Edinburgh, Scotland, September
7-10, 1999, pp. 211-222. Postscript
Edwin M. Knorr, Raymond T. Ng, and Ruben H. Zamar. "Robust Space
Transformations for Distance-based Operations", Proc. SIGKDD, San
Francisco, August 26-29, 2001, pp. 126-135. Postscript
More information on outlier-detection in video surveillance can
be found in:
Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. "Distance-Based
Outliers: Algorithms and Applications", The VLDB Journal, 8(3),
February, 2000, pp. 237-253. Postscript
or Compressed Postscript
|