|Title:||Optimising Locality-Sensitive Hashing on Sequences in the Context of Motif Finding|
Department of Computer Science, University of British Columbia
Locality-sensitive hashing has been applied in several problems in bioinformatics to quickly search large sequences by examining the n-grams of the sequences. We observe in these applications a bias in the random generation of hashes and define an effective equivalence for hashes that generate nearly identical results. Methods that address these two points demonstrate an improvement to the rate at which unique hashes are generated, especially when many hashes are generated. These methods can be easily integrated with existing applications of locality-sensitive hashing, resulting in improved performance by reducing the time algorithms spend revisiting duplicate hashes and eliminating search biases.