05/06/10
I'm starting this journal 3 days late (whoops). Here's a brief overview of what happened during my first three days:
- Read through a bunch of Daniel's journal entries to get myself a little more familiar with the project so far.
- Had a meeting with Daniel and Chris where we discussed my first task:
- First, I should make or beef up the existing test suite.
- Then, I'm to make a vectorized version of the Smith-Waterman aligner that Daniel has working.
- I should also familiarize myself with existing tools (e.g. Bowtie, BWA, Mosaik, and SamTools).
- I ended up spending a lot of time reading through the various guides Daniel posted, as well as the Git guide and the Google C++ style guide to get myself more familiarized with what I should be doing and how I should be doing it.
- I also spent some time reading up on the Smith-Waterman algorithm and reading up on sequence alignment in general to get myself a little more familiar with the project.
- Checked out Daniel's existing code, especially
local_aligner.cc
bit (which I'm assuming my vectorized SW will be based a lot off of). I can understand the majority of the code, although I'm a little iffy on some of the gap penalty calculations.
So after all that, I think I'm finally ready to do a little coding. Today, I'll mostly be working on the test suite and trying to get that out of the way.
End-of-the-day post:
I've looked through the local aligner source code in a bit more detail and I think I know my way around it pretty well now. I'm still a little unclear on the free gap policies (I understand the premise and a little of how it works, but I still need to wrap my head around some of the cases in the algorithm), but Chris told me not to worry too much about it, as it's not a priority for now.
I've managed to modify Daniel's
test/correctness_test.cc
so that it works on the existing code, and all the test cases pass. The test cases Daniel wrote were pretty complete but only tested the
ScoreOnlyAlignment
method in the local aligner. So next up will be writing some test cases for the
FullAlignment
, which may be a little harder since it doesn't just output numbers. I've written a few tests so far, and I think I may have found one or two bugs (but I'm going to have to look into that some more to see if it was my test case or the code that was failing). So tomorrow, I will mostly be doing...
- Finishing up the test cases for the
FullAlignment
method.
- Fixing any bugs that might come up along the way.
- If I get a chance...maybe even start looking into the vectorized methods!
05/07/10
Finished writing the test cases for the local aligner and corrected a very minor bug I found along the way. I mostly used Daniel's test cases for the
ScoreOnlyAlignment
method, but added the
FullAlignment
method to the tests. However, I didn't manage to get middle-of-sequence mismatch and gap tests working, since there are so many different possibilities for the alignment to turn out, and it depends very heavily on the scoring method (and the values of the different penalties) as well.
Chris also gave me the source code for a vectorized Smith-Waterman function (from the SHRiMP library) and I spent the rest of the afternoon looking at that and trying to understand it. I found the documentation for the SSE2 functions on the Intel website (link to come later, not on the right now). From my understanding (and Chris' explanation), the function goes through the matrix using a diagonal vector ("positive slope") operating on 8*16bit values, and calculates the highest score from that vector. It uses the affine gap penalty method to calculate scores. From my discussion with Chris, I should now be working on:
- Integrating the vectorized Smith-Waterman function into the NGSA library (I can either try using the source code given or make my own).
- Configuring CMake so that it detects support for SSE2 (and maybe SSE3) so that it will compile either the vectorized Smith-Waterman or non-vectorized Smith-Waterman, depending on the machine.
- Take a look at the banded Smith-Waterman function and try integrating that, too, perhaps. This function only checks a "strip" of cells going down the diagonal of the score matrix, since in practice, alignments aren't as useful when there are large variations from a full match.
- Finally, in the distant future, start thinking about the free gap algorithms and vectorizing them.
05/10/10
I spent the morning trying to integrate the SHRiMP vectorized Smith-Waterman code into a new local aligner class, but I couldn't get it to pass the test cases and kept getting segmentation faults, etc. So, I decided it was more trouble than it was worth and to try building my own version of the vectorized Smith-Waterman directly. I also had a lot of trouble getting CMake to do the right thing, but I later figured out it was because I forgot to use the
-msse2 -msse3
compiler flags.
I'm about a quarter complete on a score-only vectorized version for the local aligner, and I've got a basic game plan laid out for completing it; I've also found that the best way to understand some code is to try to code it yourself! I now have a much firmer grasp on the whole SHRiMP aligner and on why it does some of the things it does (for example, one question I had was why it stored values for
b_gap
, but no
a_gap
array). I think by tomorrow, I can:
- Get the score-only vectorized Smith-Waterman finished and hopefully tested out.
- Time permitting, get started on the full alignment with traceback.
- Read up on CMake a bit more; it's still giving me a strange amount of trouble...
05/11/10
Today felt pretty pointless. I managed to finish my score-only Smith-Waterman alignment, but it ended up failing a bunch of tests. So I took a look at the SHRiMP code again and made a few changes based on that code to iron out the bugs. In the end, my code ended up looking almost exactly like the SHRiMP code, and yet was still buggy, so I ended up just copying and pasting the code into my version. I decided to cut and paste the SHRiMP code into my own class for a couple of reasons:
- I'll have to use that code again, but modified when I do the full alignment (with traceback).
- The SHRiMP code required a call to initialization, which was troublesome and not really required given the way the NGSA LocalAligner works.
- Since SHRiMP was in C, I couldn't wrap their functions in namespaces.
I'll still have to make sure if my decision was correct with Chris, though.
The good news is, I got all the tests to pass on the score-only alignment. Now all I have to do is get a full alignment working. I also want to try to optimize for the linear gap penalty method, since that case is much simpler than the affine gap penalty. Although the linear case is just affine with a gap open penalty set to 0, I think I can do away with some of the vectors in the calculation, and so make it a little faster. That'll be a low priority, though. So for tomorrow, I think I'll be able to finish:
- The full Smith-Waterman alignment (vectorized)
- Hopefully get that to pass all the tests (i.e. make the behaviour the same as the non-vectorized version)
- Take a look at optimizing for linear gap penalty if I have time.
05/12/10
Chris told me that the full alignment was a lower priority, since the score-only alignment is used much more often. So, I've tried optimizing for linear gap penalty instead. I managed to improve performance by not doing a few calculations required in the affine and also improve the memory usage slightly (by not having to use one genome-length-size array).
I did a few preliminary tests, using a randomly generated genome/read sequences of length 1000 each. Over 1000 iterations, my results were:
- 38.836576365 seconds using affine gap penalty
- 28.907083597 seconds using linear gap penalty
So I get a roughly 25% decrease in calculation time. Tests using more/fewer iterations and using different genome/read lengths show more or less the same amount.
Now it's on to doing the "banded" algorithm. I'm going to try making a non-vectorized version first, then move on to vectorized. I'm not really sure how I'm going to test it though, since the results are going to be a little bit different...