Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.

...

The result should be 1.22283 %.In addition to the

rate, converted to a percentage, we also output the literal percent sign ( % ). Notes:

  • print 100*6697/547664 calculates the percentage rate
  • The double quotes ( " )

...

  • surround a literal string in awk

...

    • here the percent sign ( % )
  • The comma ( , ) between the number and the string says to separate the two output fields (the rate and the literal text) with whatever the default Output Field Separator (OFS) is.
    • By default,

...

    • awk'Output Field Separator is a single space
      • while it's default Input Field Separator (FS) is whitespace (any number of space and Tab characters)

...

So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).

...

Tip

awk also has a printf function, which can take the standard formatting commands (see https://en.wikipedia.org/wiki/Printf_format_string#Type_field).

Code Block
languagebash
awk 'BEGIN{ printf("%.2f %%\n", 100*6697/547664) }'

Notes:

  • The printf arguments are enclosed in parentheses since it is a true awk function.
  • The 1st argument is the format string, enclosed in double quotes.
    • The %.2f format specifier says to output a floating point number with 2 digits after the decimal place.
    • The %% format specifier is used to output a single, literal percent sign ( % ).
    • Unlike the standard print statement, the printf function does not automatically append a newline to the output, so \n is added to the format string here.
  • Remaining All arguments to printf after the format string are the values to be substituted for the format specifiers (here our percentage computation).
    • There must be one argument for every item that is being formatted
      • i.e., one for every percent sign ( % ) except the %% literal.

Filtering by location range

...

Code Block
languagebash
cd $SCRATCH/core_ngs/samtools

# count the number of reads mapped to chromosome 2 (chrII)
samtools view -c -F 0x4 yeast_pe.sort.bam chrII

# count the number of reads mapped to chromosomes 1 or M (chrI, chrM)
samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM

# count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000
samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# since there are only 20 reads in the chrI:1000-2000 region, examine them individually
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# look at a subset of field for chrI:1000-2000 reads
# fields:  2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9

...

  • If a command line is continued on more than one line, the line continuation character ( \ ) is used.
  • There is a sprintf (string print formatted) function in awk, just as there is in many modern programming languages.
  • We use sprintf to re-format the 1st input field (here the cut flags) in hex
    • using the format directive (%x)
    • also adding a literal "0x" prefix to denote the numeric base is hexadecimal
  • The results of sprintf are stored back into the 1st field ($1), replacing the original value.
  • awk's print statement is then used with no other arguments to print all its input fields, including the changed $1.

...

Expand
titleAnswer 2 (using SAM flag)

Here's another way of doing it:

Code Block
languagebash
samtools view -c -F 0x4 -f 0x2 yeast_pe.sort.bam chrI:1000-2000


About mapping quality

Mapping qualities are quality is a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one. These are called multi-mapping or multi-hit reads.

Aligners also differ in whether they report alternate alignments for multi-hit reads. Some things to keep in mind:

...