> > | 05/21/10
Okay, I've been staring at the computer screen for way too long. I tried to figure out what's different between our baligner and the readaligner and makes readaligner so much faster, but I just can't seem to find it. Here are some of my findings:
- On 125,000 reads, readaligner runs in about 9-10 seconds, while baligner runs in between 15-16 seconds.
- If I comment out the lines that actually do the alignment (the call to
Query::align() ), readaligner takes <1 second, while baligner takes about 3 seconds. This means baligner has a lot more overhead when starting and finishing (i.e. not actually "aligning"), but this still doesn't account for the 6-7 second difference.
- On 1000 reads, baligner and readaligner call
Query::align() the same amount of times (if I use the Query class's own reverse functionality and comment out Daniel's). However, it seems calls after that start to diverge, but only slightly. For example, the call after Query::align() , MismatchQuery::firstStep() , gets called 3634 times on baligner, but only 3630 times on readaligner. It looks like this small difference is propagated downward through the calls, and ends in a difference of about 1000 more calls in baligner in a low-level function. This small difference in the beginning might be the source of the problem, since the increase in function calls could potentially get huge in 125,000 reads. However, I haven't profiled that yet, since it would take so long. Perhaps this is a good next step?
- baligner is much more complicated than readaligner in terms of class structure, so this might be giving us some overhead. Also, I've noticed vectors are used a lot when passing information around; this might be a source of overhead, especially since just the vectors seem to be returned, instead of a pointer to a vector. This could be a future point of optimization, but I doubt it would save on very much time.
I think the most promising area to look into right now is still the Query::align() area, and why it's calling MismatchQuery::firstStep() slightly more than the readaligner. I think I'm going to run a profile comparing 125,000 reads on both aligners, just to confirm if it is that portion of the code that's giving us the problem. Also, it might just be a bunch of small optimization problems and overhead from using vectors, etc. that could be resulting in the performance loss, although I kind of doubt it.
Todo: |