JaysJournal < BETA

May 2010 archive

06/01/10

Here are a few more times, this time with all bin and data files on the /tmp/ folder with times calculated from /usr/bin/time.

Using commit 5a82392eeeb5653d1b75b7a8dd7ccdd499605cfa and saligner, with the 125000 readset (real time)

Type	k	max_results	Unmapped (+1662 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	17732	4.727	0.000037816
Exact	-	10	17732	5.01	0.0004008
Mismatch	2	1	5466	8.251	0.000066008
Mismatch	2	10	5466	9.438	0.000075504

Using BWA with flags -l 99 -k 2 outputting to SAM:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch, I guess	2?	?	?	2.832	0.000022656

Using readaligner with flags -k 2 --sam --fastq:

Type	k	max_results	Unmapped	Search Time (s)	Avg time/read (s)
Mismatch	2	1	?	6.662	0.000053296

Bowtie with version info:

64-bit
Built on sycamore.umiacs.umd.edu
Sat Apr 24 15:56:29 EDT 2010
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Options: -O3 -m64  -Wl,--hash-style=both
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	105606	19394	1.310	0.00001048
1-Mismatch	`-v 1`	115631	9369	1.850	0.0000148
2/3-Mismatch	`-v 2`	118356	6644	1.878	0.000015024
2/3 Mismatch	`-v 3`	119430	5570	3.441	0.000027528
Seeded Quality	`-n 0`	112317	12683	1.478	0.000011824
Seeded Quality	`-n 1`	117659	7341	1.679	0.000013432
Seeded Quality	`-n 2`	118356	6644	1.974	0.000015792
Seeded Quality	`-n 3`	118330	6670	2.560	0.00002048

Also had a meeting with Chris today, where he clarified some of his previous communications. Our first priority should now be to compare Bowtie exact matching functions to saligner or readaligner exact matching. Daniel took on a job of writing a better locate function for exact matches. I will be looking for more areas to optimize and also fix some bugs and write a better saligner wrapper that actually takes in arguments. Also, since mismatches in cigars are recorded no differently from matches, both mismatch and exact can have their own defined cigar (e.g. 35M), instead of having to compare reads to references to generate our own cigar.

To do:

In Exact and MismatchMappers, just make a cigar on the fly instead of passing stuff to SamEntry.
More low level optimizations.
Make a better saligner wrapper.
Fix a bug in cigar generation (ticket 53df76)

06/02/10

I did a bunch of minor changes to the IO files to optimize them a little further. Most of the changes were just reserving space for strings/vectors beforehand, fixing a few unneeded lines, etc. I also looked into ticket 53df76 (buggy CIGAR output on indels), and found that it's not really a bug with the CIGAR generator. The bug actually comes from using the LocalAligner after a gap alignment to generate the "dashes" that the CIGAR generator needs. However, the behaviour for LocalAligner is unpredictable when used in this context; sometimes it will clip the start or end of the read and throw in mismatches instead of just inserting the gaps (which is actually what it's supposed to do, given it's local). So, there actually aren't bugs in either the CIGAR generator or the LocalAligner. Unfortunately, we can't really fix the bug given the current situation, because there's no other way to put dashes into the aligned ref/reads. The only solution is to either figure out how IndelQuery does its CIGAR generation or write our own IndelQuery (which we might just do anyway). Since the indel support is lower priority, I'm going to table this bug for later.

Another optimization I did today was make a convenience method in SamEntry to generate an entry representing an exact match (as per Chris' suggestion yesterday). Because exact matches are so predictable, it saves us having to generate the CIGAR, and a bunch of the tags. We also don't need to extract the original reference from the index, since it's not really needed. Mismatches should also be similar, since mismatches aren't represented differently from matches in CIGARs. However, we do have an MD:Z tag that supports mismatches (like BWA), but I'm not sure whether this is needed at all.

Finally, I added support for arguments in saligner, where we can specify reference, reads and output for convenience, so we don't have to recompile every time we need to use a different read file.

Anyways, I decided to run some tests on my newest changes using the 2622382 reads file just for completeness and for more accuracy, since I didn't do any tests on that file yesterday, either.

Under commit c6f55e33aa732c8952d1f56fa4c1fe6aa3875677, with the 2822382 reads file and Release build:

Type	k	max_results	Unmapped (+20957 invalid)	Search Time (s)	Avg time/read (s)
Exact	-	1	350717	86.634	0.000033036
Exact	-	10	350717	95.99	0.000036604

And with Bowtie (same version info as above):

Mode	Flags	Mapped	Unmapped	Search Time (s)	Average Search Time (s)
0-Mismatch	`-v 0`	2250708	371674	27.341	0.000010426

To do:

Do some profiling again...I haven't done that in a while and there's been a lot of changes to the code.
Start reading up on FM index.

06/03/10

Worked on more optimizations, this time guided by Valgrind. From KCacheGrind, I found that about 10-15% of our cycles were being used up in IO, especially making/writing SAM entries. So I went in and did a few more minor optimizations, such as using ostringstream instead of stringstream. Preliminary results (tested on my machine) shows it runs something like 10% faster now.

Up until now, I've really only been testing exact matches with saligner over the 125,000 reads file. I was doing one final test on the 2M read file (to verify output and to do a preliminary speed test), I noticed a lot of disk activity after about 40k reads! I'm guessing this is what cache misses look like (it could just be it reading from the reads file, I'm not really sure on this, have to confirm), so I'm going to have to do some more profiling with Callgrind. Granted, my system doesn't have as much memory as skorpios, but this might be one of the areas we can improve on.

Other than that, I think it might be a good idea to start reading up on FM indexes, because I really can't do much else without knowing how the aligner works. So tomorrow, I might start on my reading (probably following in Daniel's footsteps).

To do:

Check if we're really getting lots of cache misses.
Start on my reading!

Update: Ran a cachegrind over 2M reads, and didn't get very many cache misses (something like <1%), so I guess it was actually reading from input? I'll have to check out the log a bit more tomorrow. On a side note, the profiling took about 2.5 hours with 2M reads on my machine (don't even know the time because the built-in timer went negative), so it's not a very practical thing to do...

This topic: BETA > JaysJournal
Topic revision: r28 - 2010-06-04 - jayzhang