Difference: JaysJournal (51 vs. 52)

Revision 522010-07-08 - jayzhang

Line: 1 to 1

May 2010 archive

June 2010 archive

Line: 43 to 43

Investigate above behaviour a little more.
Start implementing precomputed blocks.

Added:

>
>

07/06/10

Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:

Use larger sequences (>=16,000,000 bases)
Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
Try both hardware and software popcount.

Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.

So here are the results, run on skorpios at -O3, with 40,000,000 rank calls each. Times are wall time:

Sequence length (*10^6)	Normal (ms)	Interleaved	Normal + HW popcount	Interleaved + HW popcount
1	792	750	244	237
2	864	861	265	250
4	924	935	274	255
8	1134	1063	449	363
16	1567	1460	752	639
64	3857	3018	1242	1034
256	4495	3910	1901	1358
2048	5332	4639	2915	1957

Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).

To do:

Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
Try out interleaving the count structure as well.
Implement adjustable sampling rates.

07/07/10

Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).

Again, times are in wall time, compiled at -O3, single threaded and run with 50,000,000 ranks each.

Sequence length (*10^6)	Normal (ms)	Interleaved	Interleaved + counts	Interleaved, 3-base
1	1046	985	1105	1013
8	1382	1202	1208	1245
16	3276	2768	2137	2522
32	5059	4525	3556	4474
64	5708	5100	4269	5292
128	5901	5303	4617	5545
256	5985	5387	4834	5686
512	6111	5556	4989	5786
1024	7967	7392	6690	7653

The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.

I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).

Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.

PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)

To do:

Benchmark above on skorpios.
Finish sampling rate implementation.
Get started on 32-bit implementation?

View topic | History: r73 < r72 < r71 < r70 | More topic actions...