Core NGS Resources
A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
- 1 Linux/TACC
- 2 Community Resources
- 3 Sequencing Technologies
- 4 FASTQ analysis/manipulation/QC
- 5 Reference genomes
- 6 Basic alignment and aligners
- 7 Transcriptome-aware aligners
- 8 Alignment analysis
- 9 File formats and conversion
- 10 UCSC Genome Browser
- 11 RNAseq/Transcriptome analysis
- 12 Variant calling
- 13 Genome Annotation
Linux/TACC
Linux fundamentals on this wiki
Wikis for the 3 CBRS Unix/Linux workshops:
Online tutorials:
Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
Unix bootcamp for biologists: http://korflab.ucdavis.edu/bootcamp.html
Unix primer (longer version) for biologists:
Community Resources
UCSC Genome Browser - visualize and download NGS data (see more below)
Broad Institute Integrated Genomics Viewer (IGV)
especially good for visualizing BAM file details
Introduction to Sequence analysis in the Amazon EC2 cloud
where you can "rent" Linux machines (useful if you don't have access to TACC or BRCF pods)
Galaxy website for online sequencing data analysis
SEQAnwers forum - many NGS sequencing questions answered here
A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589
Sequencing Technologies
Overviews
Technology intros
Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
Newer single molecule sequencing
Single cell sequencing
Older technologies (not common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (multi-Kb) reads often used in assemblies
FASTQ analysis/manipulation/QC
Wikipedia FASTQ format page
Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
produces nice quality report for FASTQ files
MultiQC – http://multiqc.info/
A great tool for consolidating QC multiple QC reports into one HTML page
Anna's Byte Club tutorial on using MultiQC – https://utexas.atlassian.net/wiki/display/bioiteam/Using+MultiQC
cutadapt – https://cutadapt.readthedocs.io/en/stable/
An excellent command line tool for adapter sequence removal
Good support for trimming paired-end datasets
Script that handles the details of paired-end read trimming
/work2/projects/BioITeam/common/script/trim_adapters.sh
trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
Supports trimming paired-end datasets.
fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
Suite of command line tools for FASTQ and FASTA analysis and manipulation
Good for hard clipping, FASTA file manipulations
Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
seqtk – https://github.com/lh3/seqtk
Suite of command line tools for FASTQ and FASTA analysis and manipulation
Reference genomes
Gencode – https://www.gencodegenes.org/
reference genomes, transcriptomes and high-quality annotations for human and mouse
UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
reference genomes, transcriptomes and high-quality annotations for many eukaryotes
Ensembl downloads – http://ftp.ensembl.org/pub
reference genomes, transcriptomes and high-quality annotations for many eukaryotes
NCBI
RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
well curated genome, transcriptome sequences
GenBank – https://www.ncbi.nlm.nih.gov/genbank/
public repository for sequence data, especially for prokaryotic genomes
not curated
Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
excellent introduction to the types of genome references and the vocabulary used to describe them
aimed at higher eukaryotes but vocabulary useful nonetheless
GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
Support for mapping to ALT contigs containing variants
bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md
Basic alignment and aligners
File formats
input: FASTQ format
output: the SAM (Sequence Alignment Map) format specification
SAM1.pdf – header fields, body fields, flag definitions
https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf – tag fields
Aligners
bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
fast, sensitive and easy to use
bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
fast, sensitive and extremely configurable
Comparison of different aligners
by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
The BioITeam has some TACC-aware alignment scripts you might find useful:
bwa alignment
/work/projects/BioITeam/common/script/align_bwa_illumina.sh
bowtie2 alignment
/work/projects/BioITeam/common/script/align_bowtie2_illumina.sh
merging sorted BAM files (read-group aware)
/work/projects/BioITeam/common/script/merge_sorted_bams.sh
kallisto pseudo-alignment to annotated transcripts
/work/projects/BioITeam/common/script/run_kallisto.sh
also available on many BRCF pods under /mnt/bioi/script.
many pre-built references also available in /mnt/bioi/ref_genome
email or come talk to Anna if you have questions or problems
Transcriptome-aware aligners
HISAT2 – https://daehwankimlab.github.io/hisat2/
fast, with support for alignment to single and "population" of genomes
paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
STAR (Spliced Transcripts Alignment to a Reference) – ultra-fast RNA-seq aligner
TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
exon-aware sequence alignment (uses bowtie2/bowtie )
kallisto - https://pachterlab.github.io/kallisto/about
ultra-fast RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
Alignment analysis
SAM (Sequence Alignment Map) format specification (SAM1.pdf)
Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
type in a decimal number to see which flags are set
samtools – by Heng Li
SAM/BAM conversion, flag filtering, sorting, indexing, duplicate filtering
older 0.1.xx versions: http://samtools.sourceforge.net/
newer 1.3+ versions: http://www.htslib.org/
Picard toolkit – http://broadinstitute.github.io/picard/
SAM/BAM utilities that are read-group aware
especially MarkDuplicates for flagging duplicate alignments
bedtools – http://bedtools.readthedocs.org/en/latest/
All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
Swiss army knife for all manner of common BED, BAM, VCF, GFF/GTF file manipulation.
See BEDTools Overview for some common use cases.
Available in the TACC module system
RNA-seq QC, metrics & plotting tools:
RSeQC – http://rseqc.sourceforge.net/
RNA-SeQC (Broad Institute) –
RNA-QC-Chain – http://bioinfo.single-cell.cn/rna-qc-chain.html
File formats and conversion
SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
crucial for performing format conversions, of which ChIP-seq analysis can have many
HTS format specifications – http://samtools.github.io/hts-specs/
clearinghouse page for a number of NGS formats (SAM, CRAM, VCF, BCF, etc.)
Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
BED, bedGraph, narrowPeak and many more
SRA (Sequence Read Archive) from NCBI
BioITeam script for converting GTF/GFF3 files to BED format
/work/projects/BioITeam/common/script/gtf_to_bed.pl
UCSC file format conversion scripts - useful for getting to/from WIG and BED to corresponding binary formats
Make sure you download the correct scripts for your operating system!
Also available as a BioContainers module
UCSC Genome Browser
Main UCSC Genome Browser web site
File formats - BED format especially is widely used
Table browser - Browse and download data in different formats
ENCODE data downloads at UCSC - useful for getting data to work with
Beta Test browser site - most up-to-date datasets and features; can be buggy
RNAseq/Transcriptome analysis
General RNA-seq Differential Gene Expression (DGE) analysis workflow from R's Bioconductor:
Gene quantification from BAM/BED file reads
featureCounts (part of the Subread package) – http://subread.sourceforge.net/
HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
transcriptome-aware alignment & quantification from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
DESeq2 – R Bioconductor package for DGE
DESeq (version 1) documentation:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
while DESeq2 is more sophisticated, reading the original documentation is a better introduction to concepts
DESeq2 documentation:
kallisto – https://pachterlab.github.io/kallisto/
RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
blindingly fast – but only to transcriptome
companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
overview presentation – 2015-10-21-Kallisto.Anna.pdf
The Tuxedo pipeline: RNAseq with tophat/cufflinks
one of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
rarely used now, as other tools are much faster & more accurate
TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
exon-aware sequence alignment (uses bowtie2/bowtie )
resource bundles for selected organisms (GFF annotations, pre-built bowtie2 references, etc.)
cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
transcript quantification, normalization, differential expression
Dhivya Arasappan's Introduction to RNA Seq CBRS 2021 summer school course
Variant calling
Broad institute GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
complex but powerful
used by TCGA (The Cancer Genome Atlas), 1000 Genomes
File formats
VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)