Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Tip
titleReservations

Use our summer school reservation (core-ngs-class-0605) for today when submitting batch jobs to get higher priority on the ls6 normal queue.

Code Block
languagebash
titleRequest an interactive (idev) node
# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0605

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development


Code Block
languagebash
titleSubmit a batch job
# Using our reservation
sbatch --reseservation=core-ngs-class-0605 <batch_file>.slurm

Note that today's reservation name (core-ngs-class-0605) is different from the TACC allocation/project for this class, which is OTH21164.

...

ReferenceSpeciesBase LengthContig NumberSourceDownload
hg19Human3.1 Gbp25 (really 93)UCSCUCSC GoldenPath
sacCer3Yeast12.2 Mbp17UCSCUCSC GoldenPath
mirbase v20Human subset160 Kbp1908miRBasemiRBase Downloads
vibCho (ASM836960v1)Vibrio cholerae~4 .2 Mbp3GenBankGenBank Downloads

...

Tip

The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work/projects/BioITeam/ref_genome area. E.g.:

bash
Code Block
language
/work/projects/BioITeam/ref_genome/
   bowtie2/
   bwa/
   hisat2/
   kallisto/
   star/
   tophat/


...

#
Code Block
grep to match contig names in a FASTA file
languagebash
title
Stage
the
FASTA
files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa .

...

There are many aligners available, but we will concentrate on two of the most popular general-purpose ones: bwa and bowtie2. The table below outlines the available protocols for them.

alignment typealigner optionspro'scon's
global with bwa 

single end reads:

  • bwa aln <R1>
  • bwa samse

paired end reads:

  • bwa aln <R1>
  • bwa aln <R2>
  • bwa sampe
  • simple to use (take default options)
  • good for basic global alignment
  • multiple steps needed
global with bowtie2bowtie2 
  • extremely configurable
  • can be used for RNAseq alignment (after adapter trimming) because of its many options
  • complex (many options)
local with bwa bwa mem
  • simple to use (take default options)
  • very fast
  • no adapter trimming needed
  • good for simple RNAseq analysis
    • the secondary alignments it reports can provide splice junction information
  • always produces alignments with secondary reads
    • must be filtered if not desired
local with bowtie2bowtie2 --local
  • extremely configurable
  • no adapter trimming needed
  • good for small RNA alignment because of its many options
  • complex – many options


Exercise #1: BWA global alignment – Yeast ChIP-seq

...

Code Block
languagebash
titleStart an idev session
idev -m 180 -N 1 -A OTH21164 -r core-ngs-class-0605
# or
idev -m 120 -N 1 -A OTH21164 -p development  

...

Code Block
languagebash
titlePrepare BWA reference directory for sacCer3
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cd $SCRATCH/core_ngs/references/bwa/sacCer3

# Create a symbolic link to the sacCer3 FASTA file
# so it can
# be accessed directly from this directory
ln -sf ../../fasta/sacCer3.fa
ls -l

...

Since you have your own private compute node, you can use all its resources. It has 128 cores, so re-run the R2 alignment asking for 60 execution threads.

Code Block
languagebash
bwa aln -t 60 sacCer3/sacCer3.fa \
  fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pe_R2.sai

...

Expand
titleAnswer

The bwa aln usage doesn't say anything about the program's diagnostic output, so we can assume it will be written to standard error. So it could be redirected like this:

Code Block
bwa aln -t 60 sacCer3/sacCer3.fa \
  fastq/Sample_Yeast_L005_R2.cat.fastq.gz \
  > yeast_pe_R2.sai 2> yeast_pe_R2.log


Next we use the bwa sampe command to pair the reads and output SAM format data. Just type bwa sampe 2>&1 | more to see its usage.

...

Expand
titleSetup (if needed)


Code Block
languagebash
# Copy the FASTA files for building references
mkdir -p $SCRATCH/core_ngs/references
cp $CORENGS/references/fasta/*.fa $SCRATCH/core_ngs/references/fasta/

# Copy a pre-built bwa index for sacCer3
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cp $CORENGS/references/bwa/sacCer3/*.* \
  $SCRATCH/core_ngs/references/bwa/sacCer3/

# Get the FASTQ to align
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/

# Stage the BWA .sai files
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
ln -sf ../fastq
ln -sf ../../references/bwa/sacCer3
cp $CORENGS/catchup/yeast_bwa/*.sai .


...

Suppose you wanted to look only at field 3 (contig name) values in the SAM file. You can do this with the handy cut command. Below is a simple example where you're asking cut to display the 3rd column value for the last 10 alignment records.

...

Code Block
titleSAMtools suite usage
Program: samtools (Tools for alignments in the SAM format)
Version: 1.920 (using htslib 1.920)

Usage:   samtools <command> [options]

Commands:
 
-- Indexing
 
   dict           create a sequence dictionary file
 
   faidx index/extract FASTA
   fqidx    index/extract FASTAFASTQ
   index index fqidxalignment

-- Editing
   calmd recalculate  indexMD/extractNM FASTQtags and '=' bases
  index fixmate fix mate information
   reheader replace indexBAM alignmentheader
   --targetcut Editingcut fosmid regions (for fosmid pool calmdonly)
   addreplacerg adds or replaces   recalculate MD/NM RG tags
and '=' bases markdup mark duplicates
  fixmate ampliconclip clip oligos from the end of fixreads
mate
information-- File operations
   reheadercollate shuffle and group alignments by name
replace BAM header cat concatenate BAMs
  targetcut consensus produce a consensus Pileup/FASTA/FASTQ
cut fosmid regions (formerge fosmidmerge poolsorted only)alignments
   mpileup multi-way addreplacergpileup
  adds orsort replacessort RGalignment tagsfile
   split splits markdupa file by read group
   markquickcheck duplicatesquickly check   -- File operations
if SAM/BAM/CRAM file appears intact
   fastq collateconverts a BAM to a FASTQ
  shuffle andfasta groupconverts alignmentsa byBAM nameto a FASTA
   catimport Converts FASTA or FASTQ files to SAM/BAM/CRAM
   reference concatenateGenerates BAMsa reference from aligned data
 merge  reset Reverts aligner changes in reads

-- mergeStatistics
sorted alignments  bedcov read depth per mpileupBED region
   coverage alignment depth multi-wayand pileuppercent coverage
   depth sortcompute the depth
   flagstat simple stats
  sort alignmentidxstats fileBAM index stats
   splitcram-size list CRAM Content-ID and Data-Series sizes
   splitsphase aphase fileheterozygotes
by read group stats generate stats (former bamcheck)
quickcheck   ampliconstats generate quicklyamplicon checkspecific ifstats
SAM/BAM/CRAM file appears intact
-- Viewing
   flags explain BAM fastqflags
   head header viewer
   convertstview atext BAMalignment toviewer
a FASTQ  view SAM<->BAM<->CRAM conversion
 fasta  depad convert       converts a padded BAM to aunpadded FASTABAM
   --samples Statisticslist the samples in a set bedcovof SAM/BAM/CRAM files

-- Misc
   readhelp depth[cmd] perdisplay BEDthis regionhelp message or help for [cmd]
coverage   version detailed   alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

In this exercise, we will explore five utilities provided by samtools: view, sort, index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the SAMtools/BEDtools sections tomorrow we will explore samtools capabilities more in depth.

Warning
titleKnow your samtools version!

There are two main "eras" of SAMtools development:

  • "original" samtools
    • v 0.1.19 is the last stable version
  • "modern" samtools
    • v 1.0, 1.1, 1.2 – avoid these (very buggy!)
    • v 1.3+ – finally stable!

Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing).

TACC BioContainers also offers the original samtools version: samtools/ctr-0.1.19--3.

samtools view

The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view 2>&1 | more. You should see:

Code Block
titlesamtools view usage
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -X       include customized index file
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -d STR:STR
  version information

In this exercise, we will explore five utilities provided by samtools: view, sort, index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the SAMtools/BEDtools sections tomorrow we will explore samtools capabilities more in depth.

Warning
titleKnow your samtools version!

There are two main "eras" of SAMtools development:

  • "original" samtools, controlled by author Heng Li
    • v 0.1.19 is the last stable version
  • "modern" samtools, controlled by a group of programmers
    • v 1.0, 1.1, 1.2 – avoid these (very buggy!)
    • v 1.3+ – finally stable!

Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing).

TACC BioContainers also offers the original samtools version: samtools/ctr-0.1.19--3.

samtools view

The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view 2>&1 | more. You should see:

Code Block
titlesamtools view usage
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -X       include customized index file
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -d STR:STR
           only include reads with tag STR and associated value STR [null]
  -D STR:FILE
           only include reads with tag STR and associated values listed in
           FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all  of the FLAGs in INT present [0]
  -F INT   only include reads with none of the FLAGS in INT present [0]
  -G INT   only EXCLUDE reads with all  of the FLAGs in INT present [0]
  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
        only include reads withfraction tagof STRtemplates/read andpairs associatedto valuekeep; STRINT [null]part sets seed)
-D STR:FILE -M       use the multi-region iterator only(increases includethe readsspeed, withremoves
tag STR and associated values listed in     duplicates and outputs the reads as they FILEare [null]ordered in the -qfile)
INT  -x onlySTR include reads withread mappingtag qualityto >= INTstrip (repeatable) [0null]
  -lB  STR   only include readscollapse inthe librarybackward STRCIGAR [null]operation
  -m? INT   only include reads withprint numberlong ofhelp, CIGARincluding operationsnote consumingabout region specification
  -S       queryignored sequence(input >=format INT [0]is auto-detected)
  --no-fPG INT do not onlyadd includea readsPG withline
all  of the FLAGs in INT present [0 --input-fmt-option OPT[=VAL]
  -F INT   only include reads with none of the FLAGS in INTSpecify presenta [0]single input file -Gformat INToption in the onlyform
EXCLUDE reads with all  of the FLAGs in INT present [0]   -s FLOATof subsampleOPTION reads (given INT.FRAC option value, 0.FRAC is theor OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
             fraction of templates/readSpecify pairsoutput to keep; INT part sets seed)format (SAM, BAM, CRAM)
   -M   --output-fmt-option OPT[=VAL]
  use the multi-region iterator (increases the speed, removes      Specify a single output file format duplicatesoption and outputsin the readsform
as they are ordered in the file)   -x STR   read tag toof stripOPTION (repeatable) [null]or OPTION=VALUE
  -BT, --reference FILE
    collapse the backward CIGAR operation   -?    Reference sequence FASTA printFILE long[null]
help, including note about region specification-@, --threads INT
   -S       ignored (input format is auto-detected) Number of --no-PGadditional threads doto not add a PG lineuse [0]
      --inputwrite-fmt-option OPT[=VAL]index
               Automatically Specifyindex athe singleoutput inputfiles file[off]
format option in the form  --verbosity INT
            of OPTION or OPTION=VALUESet level  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --write-index
               Automatically index the output files [off]
      --verbosity INT
               Set level of verbosity

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.

...

titleSetup (if needed)
Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/catchup/yeast_bwa/yeast_pe.sam .
Code Block
languagebash
titleConvert SAM to binary BAM
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b yeast_pe.sam > yeast_pe.bam 
  • the -b option tells the tool to output BAM format

The BAM file is a binary file, not a text file, so how do you look at its contents now? Just use samtools view without the -b option. Remember to pipe output to a pager!

Code Block
languagebash
titleView BAM records
samtools view yeast_pe.bam | more

Notice that this does not show us the header record we saw at the start of the SAM file.

Exercise: What samtools view option will include the header records in its output? Which option would show only the header records?

Expand
titleHint

Note that samtools (like bwa) writes its help to standard error, but less and more only accept input on standard input. So the syntax redirecting standard error to standard input must be used before the pipe to less or more.

samtools view 2>&1 | less -I

then search for "header" ( /header )

Expand
titleAnswer

samtools view -h shows both header and alignment records.

samtools view -H shows header records only.

samtools sort

Looking at some of the alignment record information:

Code Block
languagebash
titleView BAM records
samtools view yeast_pe.bam | cut -f 1-4 | more

You will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. This means the records are in name order and, searching through the file is very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.

If you execute samtools sort 2>&1 | more, you see its help page:

Code Block
titlesamtools sort usage
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -nof verbosity

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.

Expand
titleSetup (if needed)


Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/catchup/yeast_bwa/yeast_pe.sam .



Code Block
languagebash
titleConvert SAM to binary BAM
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b yeast_pe.sam > yeast_pe.bam 
  • the -b option tells the tool to output BAM format

The BAM file is a binary file, not a text file, so how do you look at its contents now? Just use samtools view without the -b option. Remember to pipe output to a pager!

Code Block
languagebash
titleView BAM records
samtools view yeast_pe.bam | more

Notice that this does not show us the header record we saw at the start of the SAM file.

Exercises:

  • What samtools view option will include the header records in its output?
  • Which option would show only the header records?
Expand
titleHint

Note that samtools (like bwa) writes its help to standard error, but less and more only accept input on standard input. So the syntax redirecting standard error to standard input must be used before the pipe to less or more.

samtools view 2>&1 | less -I

then search for "header" ( /header )


Expand
titleAnswer

samtools view -h shows both header and alignment records.

samtools view -H shows header records only.

samtools sort

Looking at some of the alignment record information:

Code Block
languagebash
titleView BAM records
samtools view yeast_pe.bam | cut -f 1-4 | more

You will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. This means the records are in name order and, searching through the file is very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.

If you execute samtools sort 2>&1 | more, you see its help page:

Code Block
titlesamtools sort usage
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -u         Output uncompressed data (equivalent to -l 0)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -M         Use minimiser for clustering unaligned/unplaced reads
  -R         Do not use reverse strand (only compatible with -M)
  -K INT     Kmer size to use for minimiser [20]
  -I FILE    Order minimisers by their position in FILE FASTA
  -w INT     Window size for minimiser indexing via -I ref.fa [100]
  -H         Squash homopolymers when computing minimiser
  -n         Sort by read name (natural): cannot be used with samtools index
  -N         Sort by read name (ASCII): cannot be used with samtools index
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      --no-PG
               Do not add a PG line
      --template-coordinate
          -T PREFIX  Write temporary filesSort to PREFIX.nnnn.bamby template-coordinate
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
              
of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
             
 of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --write-index
               Automatically index the output files [off]
      --verbosity INT
               Set level of verbosity

In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.

...