...
Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.
...
The result should be 1.22283 %.In addition to the
rate, converted to a percentage, we also output the literal percent sign ( % ). Notes:
- print 100*6697/547664 calculates the percentage rate
- The double quotes ( " )
...
- surround a literal string in awk
...
- here the percent sign ( % )
- The comma ( , ) between the number and the string says to separate the two output fields (the rate and the literal text) with whatever the default Output Field Separator (OFS) is.
- By default,
...
- awk's Output Field Separator is a single space
- while it's default Input Field Separator (FS) is whitespace (any number of space and Tab characters)
- awk's Output Field Separator is a single space
...
So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).
...
Tip | |||||
---|---|---|---|---|---|
awk also has a printf function, which can take the standard formatting commands (see https://en.wikipedia.org/wiki/Printf_format_string#Type_field).
Notes:
|
Filtering by location range
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/samtools # count the number of reads mapped to chromosome 2 (chrII) samtools view -c -F 0x4 yeast_pe.sort.bam chrII # count the number of reads mapped to chromosomes 1 or M (chrI, chrM) samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM # count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000 samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000 # since there are only 20 reads in the chrI:1000-2000 region, examine them individually samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 # look at a subset of field for chrI:1000-2000 reads # fields: 2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 |
...
- If a command line is continued on more than one line, the line continuation character ( \ ) is used.
- There is a sprintf (string print formatted) function in awk, just as there is in many modern programming languages.
- We use sprintf to re-format the 1st input field (here the cut flags) in hex
- using the format directive (%x)
- also adding a literal "0x" prefix to denote the numeric base is hexadecimal
- The results of sprintf are stored back into the 1st field ($1), replacing the original value.
- awk's print statement is then used with no other arguments to print all its input fields, including the changed $1.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Here's another way of doing it:
|
About mapping quality
Mapping qualities are quality is a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one. These are called multi-mapping or multi-hit reads.
Aligners also differ in whether they report alternate alignments for multi-hit reads. Some things to keep in mind:
...