JamilaSalariSandbox < Sandbox

This is my sandbox with a 100ft tall ladder tower in the middle.

Identifying Co-referential Names across Large Corpora
L. Lloyd, A. Mehler, S. Skiena

The goal of this paper presents an algorithm used to identify sets of co-referential names, which can refer to a single logical entity over a large collection of texts. It introduces the notion of morphological similarity, where two names can be found to be co-referential based on the text containing the name. The algorithm identifies co-referring name sets on a large scale in three steps: morphological similarity detection, contextual similarity analysis and clustering…
1- Morphological similarity is achieved on a syntactic basis through morphologically-sound hashing techniques
2- Contextual similarity: Contexts in which names are used determine the similarity of a pair of names. Methods using co-occurrence analysis to other entities are used to determine the probability that they are co-referent by context.
3- The two similarity measures are combined to cluster the names. Contextual evidence is weaker for unpopular names.
Note the difference with traditional cross-document co-reference analysis, where the problem lies in clustering documents mentioning the same name into sets referring to the same entity.
The goal here is to cluster the multiple names that each refers to a given entity given a set of documents. The flow and scale of the news further complicates the problem. Contextual analysis relies solely on the use of entity co-occurrence lists and high-speed dimension reduction techniques for improved efficiency. Morphological similarity hashing techniques dispense with the need for pairwise-similarity testing of all name pairs and are tuned with the use of variable precision phonetic hashing with an adjustable parameter, which give a range of operating points along the tokenization path with different precision/recall tradeoffs. The sequence of transformations to tokenize and weaken the string is modeled as a graph with each change weight reflecting how drastic the tokenization is.
Morphological similarity – Most pairs of co-referential names result from morphological transformations such as subsequence similarity, pronunciation similarity, stemming and abbreviations.
Contextual similarity – Co-occurrences associated with co-referential names are more likely to be similar than those of morphologically similar but not co-referential pairs; therefore, a vector of co-occurrence counts for each name as feature space for contextual similarity. There are two issues with dimension reduction and with similarity computation between two co-occurrence lists.
Dimension reduction – The feature space is “extremely sparse” and highly dimensional because each entity introduces a new dimension, but interacts with only a few hundred other entities. Dimension reduction techniques are based on K-means clustering and graph partitioning Contextual similarity is measured once the co-occurrence lists for two names to be compared are projected onto a reduced dimensional space. This is done using either KL-Divergence (information theoretic measure of the dissimilarity between two probability distributions), and cosine similarity, where the two contexts are viewed as vectors in a high dimensional space and computes the cosine of the angle between them. Optimizing this contextual similarity phase depends on the proper choice of dimension reduction algorithm, number of dimensions and contextual similarity measure.
Once the pairs of names that are morphologically-similar and their degrees of morphological and contextual similarity is established, the similarities need to be combined into a single probability that two names are co-referential and names must be clustered into co-reference sets using either a single link or average link clustering algorithm.

-- JamilaSalari - 02 Jun 2009

Raw edit | More topic actions

Topic revision: r1 - 2009-06-02 - JamilaSalarigmailCom