Next: Test results Up: Similarity Metric Learning for Previous: Optimization of the

Minimizing memory requirements

One frequent criticism of nearest-neighbour methods is that they require much greater use of memory than neural network algorithms. However, this expectation of high memory requirements seems to be based on an invalid numerical comparison between the number of hidden units typically used in the back-propagation approach and the much larger number of exemplars in the training database. These are not directly comparable because each hidden unit normally maintains a weight for every possible discrete feature value or for each value range in a distributed representation, whereas each training exemplar requires only memory for a specific set of feature values. An example is provided by the NETtalk problem to be described below, in which an 80-hidden-unit back propagation network contains 18,629 weights, which actually requires more memory to store than the 1000 word training set. For other data sets with a less distributed representation, the nearest-neighbour approach will often require more memory, but the difference may not be as significant as is commonly implied.

On the other hand, it is clear that it is often unnecessary to retain all training data in regions of the input space that have unambiguous classifications. Nearest-neighbour learning algorithms can reduce their memory usage by only retaining the full density of training exemplars where they are needed near to classification boundaries and thinning them in other regions. There has been a considerable amount of research on this problem in the classical nearest-neighbours literature, as is summarized in the survey by Dasarathy (1991). Most of this work is only relevant to a single-nearest-neighbour classifier, but the papers by Chang (1974) and Tomek (1976) give approaches that are relevant to a k-nearest-neighbour method.

For VSM learning, we have developed a simple editing procedure that removes data from regions that have unambiguous classifications. The method deletes an exemplar if its J nearest neighbours all agree on the same classification using the VSM classifier, and all of the neighbours assign this classification a probability above 0.6. For our experiments, we have used J = 10. There is no requirement that the removed exemplar have the same classification as its neighbours, so this can remove ``noise'' exemplars. The procedure is repeated until less than 5% of the remaining exemplars are deleted on any iteration. This is a very conservative method that deletes exemplars only within regions that have consistent classifications. The surprising result is that this actually improves generalization performance by a small but consistent amount in our experiments. This is clearly a topic on which further research would be useful.

Next: Test results Up: Similarity Metric Learning for Previous: Optimization of the

David Lowe
Wed Jul 16 17:08:22 PDT 1997