Difference: RunningAligners (2 vs. 3)

Revision 32010-05-19 - jujubix

Line: 1 to 1
 
META TOPICPARENT name="NGSAlignerProject"
Assume for all instances below that a single FASTA reference ref.fasta and a single FASTQ read file reads.fastq are in the same directory as the program executable
Line: 48 to 48
  As typing these can be a real pain, I actually automate the process using batch files...
Added:
>
>
Some notes about output... The below three are considered "identical" for our purposes

baligner
SLXA-EAS1_89:1:1:672:654/1   0   gi|170079663|ref|NC_010473.1|   4017473   65   35M   *   0   0   GCTACGGAATAAAACCAGGAACAACAGACCCAGCA   cccccccccccccccccccc]c``cVcZccbSYbY   NM:i:0   MD:Z:35   AS:i:35

bwa
SLXA-EAS1_89:1:1:672:654   0   gi|170079663|ref|NC_010473.1|   4017473   37   35M   *   0   0   GCTACGGAATAAAACCAGGAACAACAGACCCAGCA   cccccccccccccccccccc]c``cVcZccbSYbY   XT:A:U   NM:i:0   X0:i:1   X1:i:0   XM:i:0   XO:i:0   XG:i:0   MD:Z:35

bowtie
SLXA-EAS1_89:1:1:672:654/1   0   gi|170079663|ref|NC_010473.1|   4017473   255   35M   *   0   0   GCTACGGAATAAAACCAGGAACAACAGACCCAGCA   cccccccccccccccccccc]c``cVcZccbSYbY   XA:i:0   MD:Z:35   NM:i:0

Some notes about the columns:

  1. name: may vary slightly only
  2. flag: should be identical
  3. reference: should be identical
  4. position: identical for unique reads only (not for multimap, to see if this is true, check the XT tag at the end bwa output
    • XT:A:U means the read is a single Unique read, which XT:A:R means it's a Randomly selected multimap read
  5. mapping quality: different (our numbers should be nearly identical to MOSAIK, another aligner)
  6. cigar: identical for unique reads only (an encoding of the aligned read)
  7. mate reference: * for single end
  8. mate position: 0 for single end
  9. insert size: 0 for single end
  10. sequence: should be identical
  11. quality: should be identical
  12. everything beyond is a tag, which vary greatly between programs but when present...
    • NM is the number of mismatches, and should be identical
    • MD is an encoding of the aligned reference, and should be identical

A more interesting example:

baligner
SLXA-EAS1_89:1:1:713:81/1   0   gi|170079663|ref|NC_010473.1|   4058875   5   17M1D18M   *   0   0   AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA   cccccccccccccccccccccccccccJcc^^bb^   NM:i:1   MD:Z:17^G18   AS:i:32

bwa
SLXA-EAS1_89:1:1:713:81   0   gi|170079663|ref|NC_010473.1|   4058875   37   15M1D20M   *   0   0   AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA   cccccccccccccccccccccccccccJcc^^bb^   XT:A:U   NM:i:1   X0:i:1   X1:i:0   XM:i:0   XO:i:1   XG:i:1   MD:Z:15^G20

bowtie
SLXA-EAS1_89:1:1:713:81/1   4   *   0   0   *   *   0   0   AAAAGCTGGGTGAGTGGCGATGACGCGCAATAAAA   cccccccccccccccccccccccccccJcc^^bb^   XM:i:0

In this case, bowtie returns an unmapped hit (it doesn't allow indels by default), while bwa returns an identical unique hit. Note that the CIGAR and MD are slightly different, but upon further investigation, the different is due to where the gap is introduced: we place the gap on the last G in the triplet of G's, while bwa places the gap on the first.

And finally, a multimap example, when allow_multimap is true:

baligner
SLXA-EAS1_89:1:1:721:668/1.1   0   gi|170079663|ref|NC_010473.1|   521562   62   35M   *   0   0   GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG   ccccccccccccccccccccccccccccccbbbbb   NM:i:0   MD:Z:35   AS:i:35
SLXA-EAS1_89:1:1:721:668/1.2   0   gi|170079663|ref|NC_010473.1|   634822   62   35M   *   0   0   GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG   ccccccccccccccccccccccccccccccbbbbb   NM:i:0   MD:Z:35   AS:i:35

bwa
SLXA-EAS1_89:1:1:721:668   0   gi|170079663|ref|NC_010473.1|   521562   0   35M   *   0   0   GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG   ccccccccccccccccccccccccccccccbbbbb   XT:A:R   NM:i:0   X0:i:2   X1:i:2   XM:i:0   XO:i:0   XG:i:0   MD:Z:35   XA:Z:gi|170079663|ref|NC_010473.1|,+634822,35M,0;gi|170079663|ref|NC_010473.1|,+633775,35M,1;gi|170079663|ref|NC_010473.1|,+520515,35M,1;

bowtie
SLXA-EAS1_89:1:1:721:668/1   0   gi|170079663|ref|NC_010473.1|   521562   255   35M   *   0   0   GCTGTAGATCTGGAAATCGCAACGGAGGAAGAAAG   ccccccccccccccccccccccccccccccbbbbb   XA:i:0   MD:Z:35   NM:i:0

Note that our programs finds two exact match hits, and lists them both with unique names.

bwa is slightly interesting:

  • it randomly selects one of the two positions we found
  • XT:A:R means that it's a Random multimap
  • X0:i:2 means that it found 2 exact matches (0 mismatches)
  • X1:i:2 means that it found 2 matches with 1 mismatch
  • XA:Z:gi|170079663|ref|NC_010473.1|,+634822,35M,0;gi|170079663|ref|NC_010473.1|,+633775,35M,1;gi|170079663|ref|NC_010473.1|,+520515,35M,1; lists the three other matches (the 1 other perfect match + the two 1 mismatch hits)

bowtie just picks one, and doesn't tell you that it's a multimap at all... (consequently, bowtie runs twice as fast as bwa)

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback