Difference: JaysJournal (51 vs. 52)

Revision 522010-07-08 - jayzhang

Line: 1 to 1
 May 2010 archive

June 2010 archive

Line: 43 to 43
 
  • Investigate above behaviour a little more.
  • Start implementing precomputed blocks.
Added:
>
>

07/06/10

Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:
  • Use larger sequences (>=16,000,000 bases)
  • Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
  • Try both hardware and software popcount.

Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.

So here are the results, run on skorpios at -O3, with 40,000,000 rank calls each. Times are wall time:

Sequence length (*10^6) Normal (ms) Interleaved Normal + HW popcount Interleaved + HW popcount
1 792 750 244 237
2 864 861 265 250
4 924 935 274 255
8 1134 1063 449 363
16 1567 1460 752 639
64 3857 3018 1242 1034
256 4495 3910 1901 1358
2048 5332 4639 2915 1957
Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).

To do:

  • Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
  • Try out interleaving the count structure as well.
  • Implement adjustable sampling rates.

07/07/10

Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).

Again, times are in wall time, compiled at -O3, single threaded and run with 50,000,000 ranks each.

Sequence length (*10^6) Normal (ms) Interleaved Interleaved + counts Interleaved, 3-base
1 1046 985 1105 1013
8 1382 1202 1208 1245
16 3276 2768 2137 2522
32 5059 4525 3556 4474
64 5708 5100 4269 5292
128 5901 5303 4617 5545
256 5985 5387 4834 5686
512 6111 5556 4989 5786
1024 7967 7392 6690 7653

The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.

I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).

Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.

PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)

To do:

  • Benchmark above on skorpios.
  • Finish sampling rate implementation.
  • Get started on 32-bit implementation?
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback