...
Code Block |
---|
|
# look at a subset of field for chrI:1000-2000 reads
# fields: 2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 | \
awk 'BEGIN{FS=OFS="\t"}
{$1 = sprintf("0x%x", $1); print}' |
...
Aligners also differ in whether they report alternate alignments for multi-hit reads, and if they do, how many. Some things to keep in mind:
...
- bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
- this will be the location with the best mapping quality and alignment
- if a given read maps equally well to multiple locations, these aligners pick one location at random
- bwa aln will always report a 0 mapping quality for these multi-hit reads
- bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
- bwa mem (local alignment) can always report more than one location for a mapped read
- its definition of a secondary alignment is different (and a bit non-standard)
- if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
- there is no way to disable reporting of secondary alignments with bwa mem.
- but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
- bowtie2 can be configured to report more than one location for a mapped read
- the -k N option says to report up to N alignments for a read
- most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
- e.g. hisat2, star, tophat2.
- when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
- e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes
...
Expand |
---|
|
Code Block |
---|
language | bash |
---|
title | solution |
---|
| samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam
# or to be really fancy...
echo "`samtools view -c yeast_pe.sort.bam` \
`samtools view -c yeast_pe.sort.filt.bam`" | \
awk '{printf "%d original reads\n", $1;
printf "%d Q20 reads (%d read pairs)\n", $2, $2/2;
printf "%0.2f%% high-quality reads\n", 100*$2/$1}' |
There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads. Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451. |
Exercise: How many primary aligned reads (0x100 = 0) are in the Copy the bwa local alignment files here:
Code Block |
---|
language | bash |
---|
title | solution |
---|
|
cds; cd core_ngs/samtools
cp $CORENGS/catchup/bwa_script/bwa_local.sort.dup.bam |
...
Tip |
---|
* .
cp $CORENGS/catchup/bwa_script/bwa_local.samstats.txt . |
Exercise: How many primary aligned reads (0x100 = 0) are in the bwa_local.sort.dup.bam file?
Tip |
---|
|
samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104. |
...
Expand |
---|
|
Code Block |
---|
language | bash |
---|
title | solution |
---|
| samtools view -F 0x104 -c bwa_local.sort.dup.bam
|
There are 874791 primary alignments. Code Block |
---|
language | bash |
---|
title | solution |
---|
| samtools view -F 0x4 -f 0x100 -c bwa_local.sort.dup.bam
|
And 133209 secondary alignments. |
Look at the statistics for the bwa mem (local) alignment in the bwa_local.samstats.txt file:
Code Block |
---|
language | bash |
---|
title | solution |
---|
|
-----------------------------------------------
Aligner: bwa
Total sequences: 1317569
Total mapped: 1008000 (76.5 %)
Total unmapped: 309569 (23.5 %)
Primary: 874791 (86.8 %)
Secondary: 133209 (13.2 %)
Duplicates: 428418 (42.5 %)
Fwd strand: 502782 (49.9 %)
Rev strand: 505218 (50.1 %)
Unique hit: 947418 (94.0 %)
Multi hit: 60582 (6.0 %)
Soft clip: 543390 (53.9 %)
All match: 328038 (32.5 %)
Indels: 6679 (0.7 %)
Spliced:
-----------------------------------------------
Total PE seqs: 1317569
PE seqs mapped: 1008000 (76.5 %)
Num PE pairs: 658784
F5 1st end mapped: 413159 (62.7 %)
F3 2nd end mapped: 594841 (90.3 %)
PE pairs mapped: 384462 (58.4 %)
PE proper pairs: 282664 (42.9 %)
-----------------------------------------------
|