Difference: JaysTermTwoJournal (1 vs. 11)

Revision 112011-02-11 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 71 to 71

On the flip side, I'm also working on getting a testing framework set up so I can verify the correctness of my results. I'm thinking I'll just generate some random sequences, run them through one of my existing local aligners to get the scores. Then, I'll just make sure the scores match.

Added:

>
>

02/10/2011

Finally finished getting all the gpuocelot stuff in place and the trace generators working. Now to start getting some tests written.

I guess I'll have two different sets of tests. One set will just be a bunch of pairs of sequences, one being the read and the other being the reference (with some mismatches/gaps inserted). Another set will have reads matched with multiple candidate alignments (a variable amount). The second set should more accurately simulate the alignment process. To verify, I'm just going to run the alignments through the existing local aligners to get scores.

-- Main.jayzhang - 05 Jan 2011

Revision 102011-02-07 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 56 to 56

Optimize!
Figure out the efficiency of my current implementation

Added:

>
>

02/07/2011

Haven't done much updating lately. As per Chris' suggestion, I've been working on finding ways to assess and benchmark performance gains on the GPGPU kernel code. I've also finished a working version of a full Smith-Waterman alignment (no affine scores yet).

Luckily, it turns out that my emulator, =gpuocelot=, is able to do some pretty good performance benchmarks. Some numbers that can be looked at are:

Memory Occupancy (given as a percentage) - this one is generated by a spreadsheet provided by Nvidia. It is basically a measure of how many warps are able to be active at a given time versus the maximum number of allowable warps. This is determined by how much shared memory and registers each thread uses, as these are shared resources. The effects of memory occupancy, however, are lessened by increasing the number of thread blocks (i.e. kernels should run at least one thread block per multiprocessor). I believe the end aligner will actually use as many thread blocks as possible (one thread block per read, can scale the reads up however much I want), so occupancy might not be as much of an issue.
Activity factor - I'm not sure if this number is given specifically by gpuocelot; it might just be a number used for other calculations. Basically, activity factor is determined by branching; it tracks the ratio between the average number of threads run at one time to the maximum number of threads. Activity factor decreases when there are more divergent branches (e.g. if a branch splits a warp into two, then the activity factor may be something like 50%).
Memory intensity - basically, a measure of how much the global memory is being used. Lower memory intensities mean the kernel is more compute-bounded than latency-bounded.
Memory efficiency - measures how efficient global memory is being accessed. I think divergent branches and bank conflicts may lower this number.
Inter-thread dataflow - measures how much the shared memory is being used. This number will affect the aligner quite a bit, I think, since I make use of shared memory a lot for each alignment.
Parallelism - measures MIMD and SIMD parallelism. These values are a measure of how scalable the kernel is, if we add more available multiprocessors. MIMD parallelism pretty much depends on the number of blocks (i.e. what benefit adding more multiprocessors will give), while SIMD parallelism is determined by the efficiency of the kernel within a block (I think this is akin to the benefit adding more warps will give? I'm not sure on this count). I believe the SIMD parallelism is a reflection on the activity factor.

So there's a lot of numbers being thrown around. I found this [[http://www.gdiamos.net/papers/iiswcOcelot.pdf][paper] to be helpful in this regard. I'm still having a little trouble getting all these numbers to actually display, so this is still an area I'm working on.

-- Main.jayzhang - 05 Jan 2011

Revision 92011-01-24 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 46 to 46

To do:

Start thinking about Smith-Waterman on the GPU?

Added:

>
>

01/20/2011

I managed to implement a basic version of the Smith-Waterman on the GPU. The version is linear matching only (not Affine). Currently, I just create a matrix, where the reference is on the x axis and the read is on the y axis. I then create one thread per read base and align one sequence in each thread block. The alignment just runs through the matrix horizontally with a diagonal vector, then spits out the maximum score. It's also not completely optimized yet (have some __syncthreads() calls where I don't need them and don't do a parallel reduction for the max call), but it works!

The limitation with doing an alignment this way is that I'm limited by the number of threads I can have. I think the current limit is 512 threads/block (768 total max), which means I'm limited to aligning only 512-base reads, maximum. Of course, this isn't such a big problem, but Chris suggested a method to align reads so that I don't get as much "waste" in the padding regions, which may require more threads. More on this later, I guess.

To do:

Implement Affine
Optimize!
Figure out the efficiency of my current implementation

-- Main.jayzhang - 05 Jan 2011

Revision 82011-01-18 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 40 to 40

To do:

GPU stuff.

Added:

>
>

01/18/2011

I started playing around with CUDA. I've decided to start a new, empty branch and write the CUDA stuff separately from the main NGSA library so I don't have to worry about getting CMake to compile CUDA stuff. So far, I've just been reading up on standard techniques, and I've also implemented a simple hamming distance "aligner". The aligner pretty much just copies the read and reference string into constant global memory, then threads compare the characters at each point (one thread per character) and assign a '1' or '0' based on the result of the comparison. The second phase just adds all the comparison results together to get the total number of mismatches. The addition uses a simple parallel reduction algorithm I found on these slides

To do:

Start thinking about Smith-Waterman on the GPU?

-- Main.jayzhang - 05 Jan 2011

Revision 72011-01-18 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 34 to 34

To do:

Start a very basic GPU implementation of the aligners!

Added:

>
>

01/17/2011

Started a new branch for the GPU stuff. I think the first thing I'm going to implement is an "aligner" that just finds how many matches straight-up matches there are. I got the emulators and everything working again, so I have my development environment set up; I forgot to write down what I did last time to get everything to compile and run properly, so I ended up having to re-research it all again, which was a pain. Also started reading through guides and tutorials and taking notes.

To do:

GPU stuff.

-- Main.jayzhang - 05 Jan 2011

Revision 62011-01-14 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 28 to 28

To do:

Finish the test driver

Added:

>
>

01/13/2011

Finished the test driver, and it looks like it works, even when simulating low memory conditions in the hash index. I tested using a manually generated test sequence + some test reads, and also using the E. coli bacterial genome and both work fine. I decided to add an ExtractString method to the hash index to allow extraction of the string from a position (much like the getPosition method in the FMIndex) to make it all work. I feel like that method could be optimized, but that can be for later.

To do:

Start a very basic GPU implementation of the aligners!

-- Main.jayzhang - 05 Jan 2011

Revision 52011-01-13 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 22 to 22

To do:

Create a small test driver

Added:

>
>

01/10/2011

Nearly finished the small test driver with a (very) small test set. I realized there's an issue with the hash index that doesn't actually allow me to get a sequence out for the local aligner, only a position, so I'll have to correct that before finishing. Otherwise, the very basic tests look good and it looks like everything is working properly.

To do:

Finish the test driver

-- Main.jayzhang - 05 Jan 2011

Revision 42011-01-10 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 16 to 16

Create a small test driver
Add FullAlignment methods into the vectorized aligners.

Added:

>
>

01/07/2011

Finished adding in full alignment methods into the vectorized aligners. Now all the aligners are completely redone and I just have to do up the test driver.

To do:

Create a small test driver

-- Main.jayzhang - 05 Jan 2011

Revision 32011-01-07 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Line: 7 to 7

Finish up the local aligner changes and commit
Create a small test driver that ties together the aligners and the has index to verify correctness.

Added:

>
>

01/06/2011

Finished redoing the banded local aligner. Now I just have to add FullAlignment methods into the vectorized aligners. Note that only the ScoreOnlyMax methods of the vectorized aligners actually are vectorized. The FullAlignment method is just a straight copy from the corresponding non-vectorized aligners, since FullAlignment is used much less.

Also, I took out all the kmer index/mapper stuff, since it was still using the old aligner and I didn't want to update it. Those files are really old anyway, and really haven't been updated much for a long time...

To do:

Create a small test driver
Add FullAlignment methods into the vectorized aligners.

-- Main.jayzhang - 05 Jan 2011

Revision 22011-01-06 - jayzhang

Line: 1 to 1

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Changed:

<
<

Starting this journal for development into the GPU stuff...

>
>

Started reading more in-depth into CUDA. Made some notes (I'll post them later).

Things to do:

Finish up the local aligner changes and commit
Create a small test driver that ties together the aligners and the has index to verify correctness.

-- Main.jayzhang - 05 Jan 2011

Revision 12011-01-05 - jayzhang

Line: 1 to 1

Added:

>
>

META TOPICPARENT	name="NGSAlignerProject"

01/05/2011

Starting this journal for development into the GPU stuff...

-- Main.jayzhang - 05 Jan 2011

View topic | History: r11 < r10 < r9 < r8 | More topic actions...