A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
| Table of Contents |
|---|
Sequencing Technologies
Community Resources
...
Linux/TACC
- Linux fundamentals on this wiki
- Wikis for the 3 CBRS Unix/Linux workshops:
Online tutorials:
- Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
- Unix bootcamp for biologists: http://korflab.ucdavis.edu/bootcamp.html
- Unix primer (longer version) for biologists:
Community Resources
- UCSC Genome Browser - visualize and download NGS data (see more below)Galaxy website for online sequencing data analysis
- Broad Institute Integrated Genomcs Genomics Viewer (IGV) -
- especially good for
Getting started with Linux and NGS
- Cheat sheet of useful Unix commands
- visualizing BAM file details
- Introduction to Sequence analysis in the Amazon EC2 cloud
- where you can "rent" Linux machines (useful if you don't have access to TACC or BRCF pods)
- Galaxy website for online sequencing data analysis
- SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data
- : http://seqanswers.com/forums/showthread.php?t=4589
...
Sequencing Technologies
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer single molecule sequencing
- Single cell sequencing
- Older technologies (not common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (multi-Kb) reads often used in assemblies
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
FASTQ analysis/manipulation/QC
- Wikipedia FASTQ format page
- Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
- FastQC from Babraham Bioinformatics ; – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for
- FASTQ files
- MultiQC – http://multiqc.info/
- A great tool for consolidating QC multiple QC reports into one HTML page
- Anna's Byte Club tutorial on using MultiQC – https://utexas.atlassian.net/wiki/display/bioiteam/Using+MultiQC
- cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasets
- Script that handles the details of paired-end read trimming
/work2/projects/BioITeam/common/script/trim_adapters.
- FASTX Toolkit - Command line tools for fastq analysis and manipulation
- Illumina library construction on GSAF user wiki - useful for contaminent detection or adapter removal.
Alignment and aligners
- Jeff Barrick's introduction to NGS presentation
- Comparison of different aligners
- by Heng Li, developer of BWA and MAQ
- by Nils Homer, developer of BFAST
- Aligners
- bowtie (http:bowtie-bio.sourceforge.net/) - very fast, not very sensitive
- BFAST wiki & manual - slow and relatively complicated, but tunable sensitivity
- bwa -
sh
- trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets.
- fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Suite of command line tools for FASTQ and FASTA analysis and manipulation
- Good for hard clipping, FASTA file manipulations
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
- seqtk – https://github.com/lh3/seqtk
- Suite of command line tools for FASTQ and FASTA analysis and manipulation
Reference genomes
- Gencode – https://www.gencodegenes.org/
- reference genomes, transcriptomes and high-quality annotations for human and mouse
- UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- Ensembl downloads – http://ftp.ensembl.org/pub
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- NCBI
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- well curated genome, transcriptome sequences
- GenBank – https://www.ncbi.nlm.nih.gov/genbank/
- public repository for sequence data, especially for prokaryotic genomes
- not curated
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
- excellent introduction to the types of genome references and the vocabulary used to describe them
- aimed at higher eukaryotes but vocabulary useful nonetheless
- excellent introduction to the types of genome references and the vocabulary used to describe them
- GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
- Support for mapping to ALT contigs containing variants
- bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md
Basic alignment and aligners
- File formats
- input: FASTQ format
- output: the SAM (Sequence Alignment Map) format specification
- SAM1.pdf – header fields, body fields, flag definitions
- https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf – tag fields
- Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
- fast, sensitive
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- File formats
- and extremely configurable
- Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
- The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
/work/projects/BioITeam/common/script/align_bwa_illumina.sh
- bowtie2 alignment
/work/projects/BioITeam/common/script/align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
/work/projects/BioITeam/common/script/merge_sorted_bams.sh
- kallisto pseudo-alignment to annotated transcripts
/work/projects/BioITeam/common/script/run_kallisto.sh
- also available on many BRCF pods under /mnt/bioi/script.
- many pre-built references also available in /mnt/bioi/ref_genome
- email or come talk to Anna if you have questions or problems
- bwa alignment
Transcriptome-aware aligners
- HISAT2 – https://daehwankimlab.github.io/hisat2/
- fast, with support for alignment to single and "population" of genomes
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- STAR (Spliced Transcripts Alignment to a Reference) – ultra-fast RNA-seq aligner
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- kallisto - https://pachterlab.github.io/kallisto/about
- ultra-fast RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
Alignment analysis
- SAM (Sequence Alignment Map) format specification (SAM1.pdf)
sam/bam tools- samtools - sam/bam Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- type in a decimal number to see which flags are set
- samtools - sam/bam Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- samtools – by Heng Li
- SAM/BAM conversion, flag filtering, bam sort/indexPicard - sam/bam , sorting, indexing, duplicate filtering
- older 0.1.xx versions: http://samtools.sourceforge.net/
- newer 1.3+ versions: http://www.htslib.org/
- Picard toolkit – http://broadinstitute.github.io/picard/
- SAM/BAM utilities that are read-group aware
- Translate SAM file flags - type in a decimal number to see which flags are set
- SAMstat - produces detailed graphical statistics for sam/bam files.
- BEDTools - region overlap, merge, coverage & much more, w/bed, bam, vcf, gff support
- BEDTools user manual (pdf)
UCSC Genome Browser
- intro on this wiki
- Main UCSC Genome Browser web site
- especially MarkDuplicates for flagging duplicate alignments
- bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common BED, BAM, VCF, GFF/GTF file manipulation.
- See BEDTools Overview for some common use cases.
- Available in the TACC module system
- RNA-seq QC, metrics & plotting tools:
- RSeQC – http://rseqc.sourceforge.net/
RNA-SeQC (Broad Institute) –
- RNA-QC-Chain – http://bioinfo.single-cell.cn/rna-qc-chain.html
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- HTS format specifications – http://samtools.github.io/hts-specs/
- clearinghouse page for a number of NGS formats (SAM, CRAM, VCF, BCF, etc.)
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- BioITeam script for converting GTF/GFF3 files to BED format
/work/projects/BioITeam/common/script/gtf_to_bed.pl
- UCSC file format conversion scripts - useful for getting to/from WIG and BED to corresponding binary formats
- Make sure you download the correct scripts for your operating system!
- Also available as a BioContainers module
UCSC Genome Browser
...
- Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
Variant calling
- The 1000 Genomes project - catalog of human genetic variants
- Tools
- Broad institute GATK - complex but powerful; used by 1000 Genomes
- File formats
- VCF (Variant Call Format) v4.0 - developed by 1000 Genomes project
RNAseq/Transcriptome analysis
- General RNA-seq Differential Gene Expression (DGE) analysis workflow from R's Bioconductor:
- Gene quantification from BAM/BED file reads
- featureCounts (part of the Subread package) – http://subread.sourceforge.net/
- HTSeq – https://htseq.readthedocs.io/en/master/
- HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
- transcriptome-aware alignment & quantification from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- DESeq2 – R Bioconductor package for DGE
- DESeq (version 1) documentation:
- https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
- while DESeq2 is more sophisticated, reading the original documentation is a better introduction to concepts
- DESeq2 documentation:
- DESeq (version 1) documentation:
- kallisto – https://pachterlab.github.io/kallisto/
- RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
- blindingly fast – but only to transcriptome
- companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
- overview presentation – 2015-10-21-Kallisto.Anna.pdf
- RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
- The Tuxedo pipeline: RNAseq with tophat/cufflinks
- tophat - exon-aware sequence alignment (uses bowtie)
- cufflinks - transcript assembly, differential expression & regulationone of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
- rarely used now, as other tools are much faster & more accurate
- RNAseq analysis protocol article in Nature Protocols
- cufflinks TopHat- http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- resource bundles for selected organisms (gff GFF annotations, pre-built bowtie bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
- transcript quantification, normalization, differential expression
- Dhivya Arasappan's Introduction to RNA Seq CBRS 2021 summer school course
Variant calling
- Broad institute GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
- complex but powerful
- used by TCGA (The Cancer Genome Atlas), 1000 Genomes
Format converters and miscellaneous tools
- SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
- Mason program for simulating second-generation sequencing reads.
Other courses with online tutorials
- 2012 Next-Gen Sequence Analysis Workshop (Michigan State University) has similar tutorials to our course, but also includes introductions to using the Amazon EC2 where you can "rent" Linux machines (useful if you don't have access to TACC), Python, R, ChIP-Seq, etcFile formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
- The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
- Dan Deatherage's Genome Variant Analysis CBRS 2021 summer school course
Genome Annotation
- GO – http://geneontology.org/
- The Gene Ontology resource, a large source of information on the functions of genes
- GOrilla – http://cbl-gorilla.cs.technion.ac.il/
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool
- GSEA – https://www.gsea-msigdb.org
- Gene Set Enrichment Analysis
- DAVID – https://david.ncifcrf.gov/
- Functional annotation from user-supplied gene lists
- GREAT – http://bejerano.stanford.edu/great/public/html/splash.php
- Genomic Regions Enrichment of Annotations Tool
- Takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- human, mouse, zebrafish
- MEME-suite – http://meme-suite.org/
- A motif identification and discovery tool. Works with most species.
- Takes FASTA files as input
- filter your BAM/BED files to get the regions of interest
- then convert to FASTA using bedtools bamtofastq.