> > | 07/06/10
Had a discussion with Chris about experimental protocols. Turns out my experiment yesterday pretty much all fits in the cache (which is 6 MB), so there wouldn't be that much of a benefit between sequential (interleaved) and parallel (normal). So we decided on a new protocol:
- Use larger sequences (>=16,000,000 bases)
- Use random access. More specifically, build an array of random numbers, then use those numbers to call random ranks. This should prevent streaming.
- Try both hardware and software popcount.
Also, I finished implementing count blocks; currently, each 64-bit block has a count block (which is 32 bits), with 4 count blocks per rank block, one for each base.
So here are the results, run on skorpios at -O3 , with 40,000,000 rank calls each. Times are wall time:
Note that the 2048 million times may not be quite as accurate. The test at 1024M segfaulted for all four implementations (later note: turns out this was a problem in the driver, where I used a signed int by accident).
To do:
- Try out using a different count structure (only store 3/4 bases, and calculate the last base) and see the performance hit.
- Try out interleaving the count structure as well.
- Implement adjustable sampling rates.
07/07/10
Implemented another structure with interleaved counts, as well as a count structure with only 3 bases. Since skorpios is down right now, I just did some preliminary testing with my laptop (32-bit, Intel Core 2 Duo, no hardware popcount, 6 MB cache, 3.2 GB RAM).
Again, times are in wall time, compiled at -O3 , single threaded and run with 50,000,000 ranks each.
The 1024M times are again a bit dubious. I fixed the segfault bug, but I think my computer was running out of memory, and it looks like a big jump in time. Regardless, I'm going to test these again on skorpios when it gets back. It doesn't look like there's too much of a performance hit with the 3-base implementation, but if we stick with interleaved counts, there's really no point in going that route, since the 4 32-bit counts fit nicely into 2 64-bit "rank blocks". Using only 3 counts wouldn't do much for memory, since there will still be those 32 bits available.
I also experimented with some different popcount functions, found on the Wikipedia entry. So far, it seems like the current popcount function (which I stole from readaligner ) is the faster than the one proposed by Wikipedia. However, the current popcount function relies on a table of values, which might make it slower if there's a lot of cache misses on it, though it outperformed Wikipedia's function even at 1024M bases (don't have the exact values, another thing to benchmark).
Finally, I started implementing the sampling rates; there's currently a bug somewhere, but it shouldn't be too bad to fix.
PS: Before I forget, I realized I did actually lock the whole single end output processing section in my multithreaded driver (I originally thought I just locked the part where it writes to output, whereas now I realized that it locks before that, when it's processing the SamEntry into a string). I should experiment with locking it at a different place (though at this point, it's not that easy, unless I want to put locks in the sam file class itself.)
To do:
- Benchmark above on
skorpios .
- Finish sampling rate implementation.
- Get started on 32-bit implementation?
|