SSC Intro to NGS Bioinformatics Course

Day 1: Unix/TACC Introduction and Read Mapping

Day 2: Calling Genome Variants

Mapped read data evaluation (SAMtools)
Installing and compiling tools on Unix
Integrated pipeline for microbial genome re-sequencing analysis (breseq)
Calling variants in mixed populations
Calling variants in diploid or multiploid genomes
Practical advice about some things to watch out for with short read re-sequencing data
UCSC genome browser, SRA data downloads - 30 min - Anna
Variant calling with GATK (use their wiki), more detail on .vcf format, look at Human data 1000 genome VCF files and describe how to access 1000 genome data Anna Scott
Comparing variants across samples bedtools/annovar/snpeff/plink/vaast/qiime Dhivya & Scott shell/perl/python scripting – candidates for recessive disease

Day 3: RNA-seq

Differential gene expression analysis
Splice variant analysis
<break>
non-coding RNA analysis unique mapping (shrimp/grep), miRNA’s abundance/editing, other: snoRNA, snRNA, lincRNA, piRNA, tRNA, degradome, etc. etc. (not poly-A; not annotated)
Transcriptome assembly & annotation velvet/oases, TrinityRNAseq; BLAST, GOminer, (ELI?)

Day 4: Assembly and Annotation

Annotating noncoding RNAs

Resources

Misc

This is code

This is the link to the gsaf web site

Rough Draft

All notes/lessons/etc. under bioiteam wiki – next meeting next Thursday May 16th, 11 am.

Survey ahead of time if possible – Aaron to create draft, run by S&J – learn nano – send seqanswers “just got your data” post.

Setup stuff on local machines: IGV; BAM/BAI files from all mappings; R; LAMP stack?? GMOD??

Data pre-loaded on corral:
a) bact. Genome specified by Geoff & bam files from D1
b) 1000 genome data (Anna)
c) sample exome (Scott)

Everyone to update Wiki

Agenda:
Day 1: Beginnings
• 45 min Linux refresher – SPHS handout - & ppt – autocomp/pipes/ssh/wget (genome upfront – download data from Corral):
• 20 min Introduction to mapping – launchers at TACC; - mapping by 3-4 different mappers – BWA/bowtie/shrimp/SSAHA2 (bfast?) – Geoff – must have outputs ready.
• 30 min Using TACC – dir.struct/module+spider/scp/bbcp/group sharing – Aaron
• 30 min Run variant caller (samtools) on login node at TACC, SCP, view output in IGV (requires bac. Genome & gff pre-installed for the ref). Geoff
<break>
• 15 min Input and output file formats – handout: FASTQ/BAM – Anna technology specific output Scott
• 15 min Building a reference - Anna
• 60 min ADVANCED session: EXERCISE: shell scripting of mapper & variant caller: given ref & reads, will produce mutations; Daechan

Day 2: Mapping & Variants
• 30 minutes: mapped data evaluation with samtools
• 30 min Installing/compiling tools: (aside from “module”) JB’s tool – google code Jeff
• 45 min View output & compare: false pos/neg – comparisons; what’s hard & weird - in IGV (GVF file/VCF file/BAMs). Jeff Geoff will check vcftools; also assess freebayes (Geoff) @GRC: bedtools will be needed to compare.
<break>
• 30 min UCSC genome browser, GEO/SRA data downloads Anna
• 30 min Variant calling with GATK (use their wiki), more detail on .vcf format, look at Human data 1000 genome VCF files and describe how to access 1000 genome data Anna Scott
• 60 min Characterizing & comparing variant files – annovar/snpeff/plink/vaast/qiime Dhivya & Scott shell/perl/python scripting – candidates for recessive disease

• ADVANCED: SAM files: parsing, picard tools (read groups, validation is important/check return codes), flagstat, filter (e.g. only use properly paired reads), insert size dist, mapping %, mapping bias by read or by genome location (e.g. on) Anna & Scott (BED tools), calling variants in mixed populations (freebayes), ChIP-seq analysis???

Day 3: RNA-seq Scott
• 60 min Quantitation & statistics: map & count Jeff ; normalization; tophat/cufflinks(cuffmerge)/cuffdiff – human & e. coli. Maybe a digression into R?
• 30 min Splice variant analysis: continue from tophat
<break>
• 60 min non-coding RNA analysis: unique mapping (shrimp/grep), miRNA’s abundance/editing, other: snoRNA, snRNA, lincRNA, piRNA, tRNA, degradome, etc. etc. (not poly-A; not annotated)
• 60 min Transcriptome assembly & annotation: velvet/oases, TrinityRNAseq; BLAST, GOminer, (ELI?)

Day 4: Assembly
• 90 min de novo assembly: E. coli bacteria --velvet (Aaron), mira (refguided Aaron), Allpaths(-LG) (Scott – optional..), mention: abyss, SOAPdenovo
• 30 min Finding and annotating genes – maker, glimmer; web tools: JCVI, NCBI, psi-blast & CDD; pfam/rfam (Scott & Jeff)
• 45 min Evaluating & visualizing assemblies – Comparing: treat assembler output as a reference genome and proceed with prior tools – challenges: contigs, errors; Visualizing: mauve, circos (may need install help) (Scott); cgview (Jeff).
• 30 Genome databases: Introduction to GMOD and/or SequenceServer (can we standup web servers on the class computers?) (Scott)

Notes from 5/17/12:
Conventions decided on - expands for hints, formats for command prompt/code
Aaron to write .sge maker script
All qsub's will run "./commands"
All examples have to have a "commands" file.