Difference: JaysJournal (68 vs. 69)

Revision 692010-08-11 - jayzhang

Line: 1 to 1
 
META TOPICPARENT name="NGSAlignerProject"
May 2010 archive
Line: 88 to 88
 
  • Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
  • Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.
Added:
>
>

08/11/10

Implemented a save/load feature for the FM index. I'm also thinking of implementing a "partial" load feature, where only the BWT string is loaded, and the other data structures are still built. The reason for this is that the BWT string is the one that takes the most memory (and maybe time, too) to build and should be constant for all indexes, while the other structures will differ depending on sampling rates and memory restrictions. So, the BWT string can be passed around between machines (with different memory restrictions) easily, while the other data structures can't.

I also did a few preliminary benchmarks, and the times were not great on the Locate function. I think this might be because we don't implement Locate the "proper" way, which guarantees a successful locate within sampling rate number of queries. Following Chris' suggestion, I tried graphing the number of backtracks it takes before a successful location is found on a random readset, and here are the results:

The sequence length was 65,751,813 bases long and consisted of the E. coli genome (4.6 Mbp) repeated multiple times. Reads were randomly generated 10-base sequences, and only the first 10 matches were "located". This was run at a sampling rate of 64 bases, and the average distance to locate was 341.68, which isn't very good. The x axis of the graph represents the distance it takes to locate, and the y axis is the number of aligments with that distance (it's logged). There were a total

  • Reference length = 65,751,813 bp (consisting of the E. coli genome repeated multiple times)
  • Randomly generated 10-base sequences, only the first 10 alignments were "located"
  • Run at a sampling rate of 64 bases for the locate structure
  • Average distance = 341.68, which isn't very good
  • x-axis is the distance it takes to locate, y-axis is the number of alignments with that distance, semi-logged
  • 855,150 locates performed in total

Locate buckets

To do:

  • Make a "proper" locate structure and compare.
 
META FILEATTACHMENT attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"
Added:
>
>
META FILEATTACHMENT attr="h" comment="" date="1281556653" name="locate-buckets.png" path="locate-buckets.png" size="79889" user="jayzhang" version="1.1"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback