Difference: JaysJournal (68 vs. 69)

Revision 692010-08-11 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

May 2010 archive

Line: 88 to 88

Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.

Added:

>
>

08/11/10

Implemented a save/load feature for the FM index. I'm also thinking of implementing a "partial" load feature, where only the BWT string is loaded, and the other data structures are still built. The reason for this is that the BWT string is the one that takes the most memory (and maybe time, too) to build and should be constant for all indexes, while the other structures will differ depending on sampling rates and memory restrictions. So, the BWT string can be passed around between machines (with different memory restrictions) easily, while the other data structures can't.

I also did a few preliminary benchmarks, and the times were not great on the Locate function. I think this might be because we don't implement Locate the "proper" way, which guarantees a successful locate within sampling rate number of queries. Following Chris' suggestion, I tried graphing the number of backtracks it takes before a successful location is found on a random readset, and here are the results:

The sequence length was 65,751,813 bases long and consisted of the E. coli genome (4.6 Mbp) repeated multiple times. Reads were randomly generated 10-base sequences, and only the first 10 matches were "located". This was run at a sampling rate of 64 bases, and the average distance to locate was 341.68, which isn't very good. The x axis of the graph represents the distance it takes to locate, and the y axis is the number of aligments with that distance (it's logged). There were a total

Reference length = 65,751,813 bp (consisting of the E. coli genome repeated multiple times)
Randomly generated 10-base sequences, only the first 10 alignments were "located"
Run at a sampling rate of 64 bases for the locate structure
Average distance = 341.68, which isn't very good
x-axis is the distance it takes to locate, y-axis is the number of alignments with that distance, semi-logged
855,150 locates performed in total

Locate buckets

To do:

Make a "proper" locate structure and compare.

META FILEATTACHMENT	attr="h" comment="Rank graph" date="1278719684" name="rank-graph.png" path="rank-graph.png" size="32863" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1278982205" name="rank-graph2.png" path="rank-graph2.png" size="26249" user="jayzhang" version="1.1"
META FILEATTACHMENT	attr="h" comment="" date="1280528212" name="rank-graph3.png" path="rank-graph3.png" size="23549" user="jayzhang" version="1.1"

Added:

>
>

META FILEATTACHMENT	attr="h" comment="" date="1281556653" name="locate-buckets.png" path="locate-buckets.png" size="79889" user="jayzhang" version="1.1"

View topic | History: r73 < r72 < r71 < r70 | More topic actions...