(r69) JaysJournal < BETA

May 2010 archive

June 2010 archive

July 2010 archive

07/30/10

Started to implement the new StaticDNASequence. The build process is going to get a little ugly, but the ranks should still be relatively clean (which is what matters). To make it easier on myself, I'm just going to do one pass through the whole sequence first to find all the N's and build the new data structure, then build the rest of the rank structure (rank blocks + count blocks). This is kind of wasteful because I can actually build the $ structure while building the rank structure (for just one pass), but I think it might be harder, given my current code. So, the build process will be a bit slower (which I think is okay...).

Finally, I did some benchmarks on the count function. Test parameters are:

read length = 35
reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,686,136 bases)
number of reads = 10,000,000, generated randomly
- this randomly generated set was copied 10 times, for a total of 100,000,000 calls to count.
random seed = 1234
times are reported in milliseconds wall time.
Build version is 64-bit, Release mode (-O3)
Memory usage is just the memory used by the rank structure

Sampling rate (bases)	Memory usage (bytes)	Type	SSE 4.2?	Time (ms)
64	1,784,768	2-bit	No	27,212
256	1,339,168	2-bit	No	46,663
64	2,973,048	3-bit	No	31,454
256	2,081,848	3-bit	No	52,876
2048	1,821,912	3-bit	No	156,832
64	1,784,768	2-bit	Yes	19,924
256	1,339,168	2-bit	Yes	33,002
64	2,973,048	3-bit	Yes	24,371
256	2,081,848	3-bit	Yes	38,524
2048	1,821,912	3-bit	Yes	90,328

For a (very) rough comparison, saligner, using readaligner's FM index, takes 64.461 seconds to do an exact mapping for 2.7 million reads (each 35 bases in length). This means it would take roughly 2,387 seconds to align 100,000,000 reads. However, this is not a very good comparison because that one's doing a locate function which is more complex, has to read from a file and write to an output file, and runs at 32 bits. Also, the rank structure might use a different sampling rate, although I believe it uses a sampling rate of 256, which is why I included that test in. So, this means we have quite a bit of room to work with!

Another note is that for the 3-bit structure, it requires a sampling rate of 2048 to be roughly comparable in memory usage. However, at this rate, it's much slower than the 2-bit structure (roughly 5-6 times slower).

To do:

Finish the $ data structure implementation in the StaticDNASequence

08/04/10

Spent all of yesterday and most of today finishing up the $ data structure for the StaticDNASequence class. I've finished testing it thoroughly (I hope). I've also integrated it with the FM Index and tested the count method to verify the output. Basically, I used the std::string::find method to look through a reference sequence (I used NC_010473, the E. coli genome) to "manually" count the occurrences of some read. Then, I used the actual FMIndex::Count method to count it and verified that they were the same. I tested a bunch of random sequences, and every one of them came out correct. So, I'm pretty sure that my rank structure and count methods are correct.

Also did a few benchmarks using the new 2-bit "enhanced" version and repeated some benchmarks from last time: Test parameters:

read length = 35
reference = e coli genome NC_010473, minus two characters (there was one Y and one R...) (length = 4,686,136 bases)
number of reads = 100,000, generated randomly
- this randomly generated set was copied 1000 times, for a total of 100,000,000 calls to count.
random seed = 1234
times are reported in milliseconds wall time.
Build version is 64-bit, Release mode (-O3)
Memory usage is just the memory used by the rank structure
There was one $ in the sequence, no N's

Sampling rate (bases)	Memory usage (bytes)	Type	SSE 4.2?	Time (ms)	Ratio to 2-bit
64	1,759,632	2-bit	No	27.071	1
256	1,320,312	2-bit	No	46,726	1
2048	1,192,176	2-bit	No	137,583	1
64	1,761,936	2-bit "special"	No	27,778	0.97
256	1,322,640	2-bit "special"	No	47,574	0.98
2048	1,194,728	2-bit "special"	No	138,833	0.99
64	2,931,176	3-bit	No	31,513	0.86
256	2,052,536	3-bit	No	53,110	0.88
2048	1,796,264	3-bit	No	156,759	0.88
64	1,759,632	2-bit	Yes	19,747	1
256	1,320,312	2-bit	Yes	32,898	1
2048	1,192,176	2-bit	Yes	74,230	1
64	1,761,936	2-bit "special"	Yes	20,471	0.96
256	1,322,640	2-bit "special"	Yes	33,809	0.97
2048	1,194,728	2-bit "special"	Yes	76,269	0.99
64	2,931,176	3-bit	Yes	24,855	0.79
256	2,052,536	3-bit	Yes	37,847	0.87
2048	1,796,264	3-bit	Yes	89,793	0.84

To do:

Start doing a Locate method.
Maybe do some code cleanup

08/06/10

Yesterday and today, I did a bunch of cleanup on both the rank structure and FM index classes. I pulled some methods out of StaticDNAStructure::Init so the code can be cleaner. I also rethought the naming scheme of a lot of variables to make the code a little more readable. For the FM index, I also pulled out methods from the FMIndex::Init method and overloaded the method to take in C++ strings (which are used by saligner currently). I also added the building of the BWT string directly into the FM index class instead of making it built externally. Currently, only one sequence is supported, but later on, there will be multi-sequence support.

Chris suggeted in an email that we prefetch ep ranks while doing both counts and locates, which might make it faster. I'll be discussing this with him later.

Finally, I finished the FMIndex::Locate function today. It works as expected, and I've tested it pretty thoroughly. Locate requires a bunch of LF mappings of each position in the BWT string to travel "backwards" until a position is reached whose position has been indexed. However, I can't actually get the specific character at the given position in the BWT to do the LF mapping. The FM index paper suggests doing an LF on both position i and i-1 for each character in the alphabet, since the only one that will differ in LF value is the character I'm currently on. I'm also thinking I could just implement a select function in my rank structure, which shouldn't be too hard.

To do:

Benchmarks! I might be able to integrate the index into saligner without too much trouble, so we can get some pretty accurate comparisons?
Implement saving/loading of the whole index. Currently, I support saving/loading of the rank structure, and the FM index is just a few more data structures so it shouldn't be too bad.

08/11/10

Implemented a save/load feature for the FM index. I'm also thinking of implementing a "partial" load feature, where only the BWT string is loaded, and the other data structures are still built. The reason for this is that the BWT string is the one that takes the most memory (and maybe time, too) to build and should be constant for all indexes, while the other structures will differ depending on sampling rates and memory restrictions. So, the BWT string can be passed around between machines (with different memory restrictions) easily, while the other data structures can't.

I also did a few preliminary benchmarks, and the times were not great on the Locate function. I think this might be because we don't implement Locate the "proper" way, which guarantees a successful locate within sampling rate number of queries. Following Chris' suggestion, I tried graphing the number of backtracks it takes before a successful location is found on a random readset, and here are the results:

The sequence length was 65,751,813 bases long and consisted of the E. coli genome (4.6 Mbp) repeated multiple times. Reads were randomly generated 10-base sequences, and only the first 10 matches were "located". This was run at a sampling rate of 64 bases, and the average distance to locate was 341.68, which isn't very good. The x axis of the graph represents the distance it takes to locate, and the y axis is the number of aligments with that distance (it's logged). There were a total

Reference length = 65,751,813 bp (consisting of the E. coli genome repeated multiple times)
Randomly generated 10-base sequences, only the first 10 alignments were "located"
Run at a sampling rate of 64 bases for the locate structure
Average distance = 341.68, which isn't very good
x-axis is the distance it takes to locate, y-axis is the number of alignments with that distance, semi-logged
855,150 locates performed in total

Locate buckets

To do:

Make a "proper" locate structure and compare.

Topic revision: r69 - 2010-08-11 - jayzhang