The Optimized Segment Support Map for the Mining of Frequent Patterns

ID
TR-2001-18
Authors
Carson Kai-Sang Leung, Raymond T. Ng and Heikki Mannila
Publishing date
November 15, 2001
Length
25 pages
Abstract
Computing the frequency of a pattern is a key operation in data mining algorithms. We describe a simple, yet powerful, way of speeding up any form of frequency counting satisfying the monotonicity condition. Our method, the optimized segment support map (OSSM), is based on a simple observation about data: Real life data sets are not necessarily be uniformly distributed. The OSSM is a light-weight structure that partitions the collection of transactions into m segments, so as to reduce the number of candidate patterns that require frequency counting. We study the following problems: (i) What is the optimal value of m, the number of segments to be used (the segment minimization problem)? (ii) Given a user-determined m, what is the best segmentation/composition of the m segments (the constrained segmentation problem)? For the segment minimization problem, we provide a thorough analysis and a theorem establishing the minimum value of m for which there is no accuracy lost in using the OSSM. For the constrained segmentation problem, we develop various algorithms and heuristics to help facilitate segmentation. Our experimental results on both real and synthetic data sets show that our segmentation algorithms and heuristics can efficiently generate OSSMs that are compact and effective.