> > | 05/14/10
I finished the vectorized banded implementation with all test cases passing! I really thought it would take more time and I was surprised when all the test cases passed. The implementation works only with square matrices, and uses 16x 8-bit vectors, so it should be quite a speed-up from the regular banded implementation. The strange thing with 8-bit vectors is that the instruction set is missing a few things compared to 16-bit vectors. For example, there's no insert or extract function, so I ended up casting the __m128i 's into an int8_t array and inserting/extracting manually; I'm not really sure if that's a good idea, though. Also, there was no max function for signed integers, so I had to write my own for that as well.
For Monday:
- Edit the CMakeList.txt files so that it compiles vector/non-vector versions based whether SSE2 is available or not.
- Polish up all the code I've written and merge.
- Add in my improvements to the original local aligner.
|