Day 1: Linux/TACC Introduction and Read Mapping
- Linux refresher
- Using TACC's Lonestar cluster
- Introduction to mapping (bowtie, BWA)
- Introduction to variant calling (SAMtools)
Extras
- Introduction to Bioinformatics Prezi
- Download Presentation on Mappers, etc.
- Workflow diagram of variant calling
- Diagram of Lonestar's directories
- Diagram of running a job on Lonestar
Day 2: Calling Genome Variants
- Using the Integrative Genomics Viewer (IGV)
- Shell Scripting
- Mapped read data evaluation (SAMtools)
- Installing Linux tools
- Identifying mutations in microbial genomes (breseq)
Extras (come early Day 3)
Additional topics
- SRA toolkit, UCSC Genome Browser
- Variant calling with GATK
- Genome variation in mixed samples (FreeBayes, deepSNV)
- Identifying structural variants (SVDetect)
Even More Extras
- Download presentation on Advanced Genome Variant Calling
- Practical advice - short read re-sequencing data
Day 3: RNA-seq
- Differential gene expression analysis
- Differential expression with splice variant analysis
- Transcriptome assembly & annotation
Extras
Day 4: Assembly and Annotation
- Genome Assembly
- Genome Assembly (velvet)
- Genome Annotation (Glimmer3)
- Evaluating & Visualizing assemblies
- Custom Genome Databases
Resources
- Resources: tool list, file formats & more
- Scott's list of linux one-liners
- Example BWA alignment script
- Exercises
Misc
This is the link to the gsaf web site
Rough Draft
All notes/lessons/etc. under bioiteam wiki – next meeting next Thursday May 16th, 11 am.
Survey ahead of time if possible – Aaron to create draft, run by S&J – learn nano – send seqanswers “just got your data” post.
Setup stuff on local machines: IGV; BAM/BAI files from all mappings; R; LAMP stack?? GMOD??
Data pre-loaded on corral:
a) bact. Genome specified by Geoff & bam files from D1
b) 1000 genome data (Anna)
c) sample exome (Scott)
Everyone to update Wiki
Agenda:
Day 1: Beginnings
• 45 min Linux refresher – SPHS handout - & ppt – autocomp/pipes/ssh/wget (genome upfront – download data from Corral):
• 20 min Introduction to mapping – launchers at TACC; - mapping by 3-4 different mappers – BWA/bowtie/shrimp/SSAHA2 (bfast?) – Geoff – must have outputs ready.
• 30 min Using TACC – dir.struct/module+spider/scp/bbcp/group sharing – Aaron
• 30 min Run variant caller (samtools) on login node at TACC, SCP, view output in IGV (requires bac. Genome & gff pre-installed for the ref). Geoff
<break>
• 15 min Input and output file formats – handout: FASTQ/BAM – Anna technology specific output Scott
• 15 min Building a reference - Anna
• 60 min ADVANCED session: EXERCISE: shell scripting of mapper & variant caller: given ref & reads, will produce mutations; Daechan
Day 2: Mapping & Variants
• 30 minutes: mapped data evaluation with samtools
• 30 min Installing/compiling tools: (aside from “module”) JB’s tool – google code Jeff
• 45 min View output & compare: false pos/neg – comparisons; what’s hard & weird - in IGV (GVF file/VCF file/BAMs). <break>
• 30 min UCSC genome browser, GEO/SRA data downloads Anna
• 30 min Variant calling with GATK (use their wiki), more detail on .vcf format, look at Human data 1000 genome VCF files and describe how to access 1000 genome data Anna Scott
• 60 min Characterizing & comparing variant files – annovar/snpeff/plink/vaast/qiime Dhivya & Scott shell/perl/python scripting – candidates for recessive disease
• ADVANCED: SAM files: parsing, picard tools (read groups, validation is important/check return codes), flagstat, filter (e.g. only use properly paired reads), insert size dist, mapping %, mapping bias by read or by genome location (e.g. on) Anna & Scott (BED tools), calling variants in mixed populations (freebayes), ChIP-seq analysis???
Day 3: RNA-seq Scott
• 60 min Quantitation & statistics: map & count Jeff ; normalization; tophat/cufflinks(cuffmerge)/cuffdiff – human & e. coli. Maybe a digression into R?
• 30 min Splice variant analysis: continue from tophat
<break>
• 60 min non-coding RNA analysis: unique mapping (shrimp/grep), miRNA’s abundance/editing, other: snoRNA, snRNA, lincRNA, piRNA, tRNA, degradome, etc. etc. (not poly-A; not annotated)
• 60 min Transcriptome assembly & annotation: velvet/oases, TrinityRNAseq; BLAST, GOminer, (ELI?)
Day 4: Assembly
• 90 min de novo assembly: E. coli bacteria --velvet (Aaron), mira (refguided Aaron), Allpaths(-LG) (Scott – optional..), mention: abyss, SOAPdenovo
• 30 min Finding and annotating genes – maker, glimmer; web tools: JCVI, NCBI, psi-blast & CDD; pfam/rfam (Scott & Jeff)
• 45 min Evaluating & visualizing assemblies – Comparing: treat assembler output as a reference genome and proceed with prior tools – challenges: contigs, errors; Visualizing: mauve, circos (may need install help) (Scott); cgview (Jeff).
• 30 Genome databases: Introduction to GMOD and/or SequenceServer (can we standup web servers on the class computers?) (Scott)
Notes from 5/17/12:
Conventions decided on - expands for hints, formats for command prompt/code
Aaron to write .sge maker script
All qsub's will run "./commands"
All examples have to have a "commands" file.
AB/DC to tacc-ify scripts; SPHS to put up chr20 fastq's, bams, and vcf's for example.
Append GATK to diploid calling.
Day 1 followup
Daechan original shell scripting page