| Tip |
|---|
|
Use our summer school reservation (core-ngs-class-0605) for today when submitting batch jobs to get higher priority on the ls6 normal queue. | Code Block |
|---|
| language | bash |
|---|
| title | Request an interactive (idev) node |
|---|
| # Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0605
# Request a 120 minute idev node on the development queue
idev -m 120 -N 1 -A OTH21164 -p development
|
| Code Block |
|---|
| language | bash |
|---|
| title | Submit a batch job |
|---|
| # Using our reservation
sbatch --reseservation=core-ngs-class-0605 <batch_file>.slurm |
Note that today's reservation name (core-ngs-class-0605) is different from the TACC allocation/project for this class, which is OTH21164.
|
...
...
| Tip |
|---|
The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work/projects/BioITeam/ref_genome area. E.g.: | Code Block |
|---|
language | bash |
|---|
/work/projects/BioITeam/ref_genome/
bowtie2/
bwa/
hisat2/
kallisto/
star/
tophat/ |
|
...
| Code Block |
|---|
| language | bash |
|---|
| title | grep to match contig names in a FASTA file |
|---|
| # | | | |
|
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa .
|
...
There are many aligners available, but we will concentrate on two of the most popular general-purpose ones: bwa and bowtie2. The table below outlines the available protocols for them.
| alignment type | aligner options | pro's | con's |
|---|
| global with bwa | single end reads: paired end reads: - bwa aln <R1>
- bwa aln <R2>
- bwa sampe
| - simple to use (take default options)
- good for basic global alignment
| |
| global with bowtie2 | bowtie2 | - extremely configurable
- can be used for RNAseq alignment (after adapter trimming) because of its many options
| |
| local with bwa | bwa mem | - simple to use (take default options)
- very fast
- no adapter trimming needed
- good for simple RNAseq analysis
- the secondary alignments it reports can provide splice junction information
| - always produces alignments with secondary reads
- must be filtered if not desired
|
| local with bowtie2 | bowtie2 --local | - extremely configurable
- no adapter trimming needed
- good for small RNA alignment because of its many options
| |
Exercise #1: BWA global alignment – Yeast ChIP-seq
...
| Code Block |
|---|
| language | bash |
|---|
| title | Start an idev session |
|---|
|
idev -m 180 -N 1 -A OTH21164 -r core-ngs-class-0605
# or
idev -m 120 -N 1 -A OTH21164 -p development |
...
| Code Block |
|---|
| language | bash |
|---|
| title | Prepare BWA reference directory for sacCer3 |
|---|
|
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cd $SCRATCH/core_ngs/references/bwa/sacCer3
# Create a symbolic link to the sacCer3 FASTA file
# so it can
# be accessed directly from this directory
ln -sf ../../fasta/sacCer3.fa
ls -l |
...
Since you have your own private compute node, you can use all its resources. It has 128 cores, so re-run the R2 alignment asking for 60 execution threads.
| Code Block |
|---|
|
bwa aln -t 60 sacCer3/sacCer3.fa \
fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pe_R2.sai |
...
| Expand |
|---|
|
The bwa aln usage doesn't say anything about the program's diagnostic output, so we can assume it will be written to standard error. So it could be redirected like this: | Code Block |
|---|
bwa aln -t 60 sacCer3/sacCer3.fa \
fastq/Sample_Yeast_L005_R2.cat.fastq.gz \
> yeast_pe_R2.sai 2> yeast_pe_R2.log |
|
Next we use the bwa sampe command to pair the reads and output SAM format data. Just type bwa sampe 2>&1 | more to see its usage.
...
| Expand |
|---|
|
| Code Block |
|---|
| # Copy the FASTA files for building references
mkdir -p $SCRATCH/core_ngs/references
cp $CORENGS/references/fasta/*.fa $SCRATCH/core_ngs/references/fasta/
# Copy a pre-built bwa index for sacCer3
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cp $CORENGS/references/bwa/sacCer3/*.* \
$SCRATCH/core_ngs/references/bwa/sacCer3/
# Get the FASTQ to align
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
# Stage the BWA .sai files
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
ln -sf ../fastq
ln -sf ../../references/bwa/sacCer3
cp $CORENGS/catchup/yeast_bwa/*.sai .
|
|
...
Suppose you wanted to look only at field 3 (contig name) values in the SAM file. You can do this with the handy cut command. Below is a simple example where you're asking cut to display the 3rd column value for the last 10 alignment records.
...
| Code Block |
|---|
| title | SAMtools suite usage |
|---|
|
Program: samtools (Tools for alignments in the SAM format)
Version: 1.920 (using htslib 1.920)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTAFASTQ
index index fqidxalignment
-- Editing
calmd recalculate indexMD/extractNM FASTQtags and '=' bases
index fixmate fix mate information
reheader replace indexBAM alignmentheader
--targetcut Editingcut fosmid regions (for fosmid pool calmdonly)
addreplacerg adds or replaces recalculate MD/NM RG tags
and '=' bases markdup mark duplicates
fixmate ampliconclip clip oligos from the end of fixreads
mate
information-- File operations
reheadercollate shuffle and group alignments by name
replace BAM header cat concatenate BAMs
targetcut consensus produce a consensus Pileup/FASTA/FASTQ
cut fosmid regions (formerge fosmidmerge poolsorted only)alignments
mpileup multi-way addreplacergpileup
adds orsort replacessort RGalignment tagsfile
split splits markdupa file by read group
markquickcheck duplicatesquickly check -- File operations
if SAM/BAM/CRAM file appears intact
fastq collateconverts a BAM to a FASTQ
shuffle andfasta groupconverts alignmentsa byBAM nameto a FASTA
catimport Converts FASTA or FASTQ files to SAM/BAM/CRAM
reference concatenateGenerates BAMsa reference from aligned data
merge reset Reverts aligner changes in reads
-- mergeStatistics
sorted alignments bedcov read depth per mpileupBED region
coverage alignment depth multi-wayand pileuppercent coverage
depth sortcompute the depth
flagstat simple stats
sort alignmentidxstats fileBAM index stats
splitcram-size list CRAM Content-ID and Data-Series sizes
splitsphase aphase fileheterozygotes
by read group stats generate stats (former bamcheck)
quickcheck ampliconstats generate quicklyamplicon checkspecific ifstats
SAM/BAM/CRAM file appears intact
-- Viewing
flags explain BAM fastqflags
head header viewer
convertstview atext BAMalignment toviewer
a FASTQ view SAM<->BAM<->CRAM conversion
fasta depad convert converts a padded BAM to aunpadded FASTABAM
--samples Statisticslist the samples in a set bedcovof SAM/BAM/CRAM files
-- Misc
readhelp depth[cmd] perdisplay BEDthis regionhelp message or help for [cmd]
coverage version detailed alignment depth and percent coverage
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM |
In this exercise, we will explore five utilities provided by samtools: view, sort, index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the SAMtools/BEDtools sections tomorrow we will explore samtools capabilities more in depth.
| Warning |
|---|
| title | Know your samtools version! |
|---|
|
There are two main "eras" of SAMtools development: - "original" samtools
- v 0.1.19 is the last stable version
- "modern" samtools
- v 1.0, 1.1, 1.2 – avoid these (very buggy!)
- v 1.3+ – finally stable!
Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing). TACC BioContainers also offers the original samtools version: samtools/ctr-0.1.19--3. |
samtools view
The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view 2>&1 | more. You should see:
| Code Block |
|---|
|
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-X include customized index file
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-d STR:STR
version information |
In this exercise, we will explore five utilities provided by samtools: view, sort, index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the SAMtools/BEDtools sections tomorrow we will explore samtools capabilities more in depth.
| Warning |
|---|
| title | Know your samtools version! |
|---|
|
There are two main "eras" of SAMtools development: - "original" samtools, controlled by author Heng Li
- v 0.1.19 is the last stable version
- "modern" samtools, controlled by a group of programmers
- v 1.0, 1.1, 1.2 – avoid these (very buggy!)
- v 1.3+ – finally stable!
Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing). TACC BioContainers also offers the original samtools version: samtools/ctr-0.1.19--3. |
samtools view
The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view 2>&1 | more. You should see:
| Code Block |
|---|
|
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-X include customized index file
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-d STR:STR
only include reads with tag STR and associated value STR [null]
-D STR:FILE
only include reads with tag STR and associated values listed in
FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
only include reads withfraction tagof STRtemplates/read andpairs associatedto valuekeep; STRINT [null]part sets seed)
-D STR:FILE -M use the multi-region iterator only(increases includethe readsspeed, withremoves
tag STR and associated values listed in duplicates and outputs the reads as they FILEare [null]ordered in the -qfile)
INT -x onlySTR include reads withread mappingtag qualityto >= INTstrip (repeatable) [0null]
-lB STR only include readscollapse inthe librarybackward STRCIGAR [null]operation
-m? INT only include reads withprint numberlong ofhelp, CIGARincluding operationsnote consumingabout region specification
-S queryignored sequence(input >=format INT [0]is auto-detected)
--no-fPG INT do not onlyadd includea readsPG withline
all of the FLAGs in INT present [0 --input-fmt-option OPT[=VAL]
-F INT only include reads with none of the FLAGS in INTSpecify presenta [0]single input file -Gformat INToption in the onlyform
EXCLUDE reads with all of the FLAGs in INT present [0] -s FLOATof subsampleOPTION reads (given INT.FRAC option value, 0.FRAC is theor OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
fraction of templates/readSpecify pairsoutput to keep; INT part sets seed)format (SAM, BAM, CRAM)
-M --output-fmt-option OPT[=VAL]
use the multi-region iterator (increases the speed, removes Specify a single output file format duplicatesoption and outputsin the readsform
as they are ordered in the file) -x STR read tag toof stripOPTION (repeatable) [null]or OPTION=VALUE
-BT, --reference FILE
collapse the backward CIGAR operation -? Reference sequence FASTA printFILE long[null]
help, including note about region specification-@, --threads INT
-S ignored (input format is auto-detected) Number of --no-PGadditional threads doto not add a PG lineuse [0]
--inputwrite-fmt-option OPT[=VAL]index
Automatically Specifyindex athe singleoutput inputfiles file[off]
format option in the form --verbosity INT
of OPTION or OPTION=VALUESet level -O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--write-index
Automatically index the output files [off]
--verbosity INT
Set level of verbosity |
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.
...
| Code Block |
|---|
| language | bash |
|---|
| title | Get the alignment exercises files |
|---|
|
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/catchup/yeast_bwa/yeast_pe.sam .
|
| Code Block |
|---|
| language | bash |
|---|
| title | Convert SAM to binary BAM |
|---|
|
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b yeast_pe.sam > yeast_pe.bam |
- the -b option tells the tool to output BAM format
The BAM file is a binary file, not a text file, so how do you look at its contents now? Just use samtools view without the -b option. Remember to pipe output to a pager!
| Code Block |
|---|
| language | bash |
|---|
| title | View BAM records |
|---|
|
samtools view yeast_pe.bam | more
|
Notice that this does not show us the header record we saw at the start of the SAM file.
Exercise: What samtools view option will include the header records in its output? Which option would show only the header records?
| Expand |
|---|
|
Note that samtools (like bwa) writes its help to standard error, but less and more only accept input on standard input. So the syntax redirecting standard error to standard input must be used before the pipe to less or more. samtools view 2>&1 | less -I
then search for "header" ( /header ) |
| Expand |
|---|
|
samtools view -h shows both header and alignment records. samtools view -H shows header records only. |
samtools sort
Looking at some of the alignment record information:
| Code Block |
|---|
| language | bash |
|---|
| title | View BAM records |
|---|
|
samtools view yeast_pe.bam | cut -f 1-4 | more |
You will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. This means the records are in name order and, searching through the file is very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.
If you execute samtools sort 2>&1 | more, you see its help page:
| Code Block |
|---|
|
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-nof verbosity |
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.
| Expand |
|---|
|
| Code Block |
|---|
| language | bash |
|---|
| title | Get the alignment exercises files |
|---|
| mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/catchup/yeast_bwa/yeast_pe.sam .
|
|
| Code Block |
|---|
| language | bash |
|---|
| title | Convert SAM to binary BAM |
|---|
|
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b yeast_pe.sam > yeast_pe.bam |
- the -b option tells the tool to output BAM format
The BAM file is a binary file, not a text file, so how do you look at its contents now? Just use samtools view without the -b option. Remember to pipe output to a pager!
| Code Block |
|---|
| language | bash |
|---|
| title | View BAM records |
|---|
|
samtools view yeast_pe.bam | more
|
Notice that this does not show us the header record we saw at the start of the SAM file.
Exercises:
- What samtools view option will include the header records in its output?
- Which option would show only the header records?
| Expand |
|---|
|
Note that samtools (like bwa) writes its help to standard error, but less and more only accept input on standard input. So the syntax redirecting standard error to standard input must be used before the pipe to less or more. samtools view 2>&1 | less -I
then search for "header" ( /header ) |
| Expand |
|---|
|
samtools view -h shows both header and alignment records. samtools view -H shows header records only. |
samtools sort
Looking at some of the alignment record information:
| Code Block |
|---|
| language | bash |
|---|
| title | View BAM records |
|---|
|
samtools view yeast_pe.bam | cut -f 1-4 | more |
You will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. This means the records are in name order and, searching through the file is very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.
If you execute samtools sort 2>&1 | more, you see its help page:
| Code Block |
|---|
|
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-u Output uncompressed data (equivalent to -l 0)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-M Use minimiser for clustering unaligned/unplaced reads
-R Do not use reverse strand (only compatible with -M)
-K INT Kmer size to use for minimiser [20]
-I FILE Order minimisers by their position in FILE FASTA
-w INT Window size for minimiser indexing via -I ref.fa [100]
-H Squash homopolymers when computing minimiser
-n Sort by read name (natural): cannot be used with samtools index
-N Sort by read name (ASCII): cannot be used with samtools index
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--no-PG
Do not add a PG line
--template-coordinate
-T PREFIX Write temporary filesSort to PREFIX.nnnn.bamby template-coordinate
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--write-index
Automatically index the output files [off]
--verbosity INT
Set level of verbosity |
In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.
...