Day 1: Unix/TACC Introduction and Read Mapping
- Linux refresher
- Using TACC
- Introduction to mapping
- Variant calling tools
- Using the Integrative Genomics Viewer (IGV)
- Shell Scripting
Day 2: Calling Genome Variants
- Mapped read data evaluation (SAMtools)
- Installing and compiling tools on Unix
- Integrated pipeline for microbial genome re-sequencing analysis (breseq)
- Calling variants in mixed populations
- Calling variants in diploid or multiploid genomes
- Practical advice about some things to watch out for with short read re-sequencing data
- UCSC genome browser, SRA data downloads - 30 min - Anna
- Variant calling with GATK (use their wiki), more detail on .vcf format, look at Human data 1000 genome VCF files and describe how to access 1000 genome data Anna Scott
- Comparing variants across samples bedtools/annovar/snpeff/plink/vaast/qiime Dhivya & Scott shell/perl/python scripting – candidates for recessive disease
Day 3: RNA-seq
- Differential gene expression analysis
- Splice variant analysis
<break> - non-coding RNA analysis unique mapping (shrimp/grep), miRNA’s abundance/editing, other: snoRNA, snRNA, lincRNA, piRNA, tRNA, degradome, etc. etc. (not poly-A; not annotated)
- Transcriptome assembly & annotation velvet/oases, TrinityRNAseq; BLAST, GOminer, (ELI?)
Day 4: Assembly and Annotation
Resources
- Tool list, file formats & more
- Scott's list of linux one-liners
- Exercises
Misc
This is code
This is the link to the gsaf web site
Rough Draft
All notes/lessons/etc. under bioiteam wiki – next meeting next Thursday May 16th, 11 am.
Survey ahead of time if possible – Aaron to create draft, run by S&J – learn nano – send seqanswers “just got your data” post.
Setup stuff on local machines: IGV; BAM/BAI files from all mappings; R; LAMP stack?? GMOD??
Data pre-loaded on corral:
a) bact. Genome specified by Geoff & bam files from D1
b) 1000 genome data (Anna)
c) sample exome (Scott)
Everyone to update Wiki
Agenda:
Day 1: Beginnings
• 45 min Linux refresher – SPHS handout - & ppt – autocomp/pipes/ssh/wget (genome upfront – download data from Corral):
• 20 min Introduction to mapping – launchers at TACC; - mapping by 3-4 different mappers – BWA/bowtie/shrimp/SSAHA2 (bfast?) – Geoff – must have outputs ready.
• 30 min Using TACC – dir.struct/module+spider/scp/bbcp/group sharing – Aaron
• 30 min Run variant caller (samtools) on login node at TACC, SCP, view output in IGV (requires bac. Genome & gff pre-installed for the ref). Geoff
<break>
• 15 min Input and output file formats – handout: FASTQ/BAM – Anna technology specific output Scott
• 15 min Building a reference - Anna
• 60 min ADVANCED session: EXERCISE: shell scripting of mapper & variant caller: given ref & reads, will produce mutations; Daechan
Day 2: Mapping & Variants
• 30 minutes: mapped data evaluation with samtools
• 30 min Installing/compiling tools: (aside from “module”) JB’s tool – google code Jeff
• 45 min View output & compare: false pos/neg – comparisons; what’s hard & weird - in IGV (GVF file/VCF file/BAMs). Jeff Geoff will check vcftools; also assess freebayes (Geoff) @GRC: bedtools will be needed to compare.
<break>
• 30 min UCSC genome browser, GEO/SRA data downloads Anna
• 30 min Variant calling with GATK (use their wiki), more detail on .vcf format, look at Human data 1000 genome VCF files and describe how to access 1000 genome data Anna Scott
• 60 min Characterizing & comparing variant files – annovar/snpeff/plink/vaast/qiime Dhivya & Scott shell/perl/python scripting – candidates for recessive disease
• ADVANCED: SAM files: parsing, picard tools (read groups, validation is important/check return codes), flagstat, filter (e.g. only use properly paired reads), insert size dist, mapping %, mapping bias by read or by genome location (e.g. on) Anna & Scott (BED tools), calling variants in mixed populations (freebayes), ChIP-seq analysis???
Day 3: RNA-seq Scott
• 60 min Quantitation & statistics: map & count Jeff ; normalization; tophat/cufflinks(cuffmerge)/cuffdiff – human & e. coli. Maybe a digression into R?
• 30 min Splice variant analysis: continue from tophat
<break>
• 60 min non-coding RNA analysis: unique mapping (shrimp/grep), miRNA’s abundance/editing, other: snoRNA, snRNA, lincRNA, piRNA, tRNA, degradome, etc. etc. (not poly-A; not annotated)
• 60 min Transcriptome assembly & annotation: velvet/oases, TrinityRNAseq; BLAST, GOminer, (ELI?)
Day 4: Assembly
• 90 min de novo assembly: E. coli bacteria --velvet (Aaron), mira (refguided Aaron), Allpaths(-LG) (Scott – optional..), mention: abyss, SOAPdenovo
• 30 min Finding and annotating genes – maker, glimmer; web tools: JCVI, NCBI, psi-blast & CDD; pfam/rfam (Scott & Jeff)
• 45 min Evaluating & visualizing assemblies – Comparing: treat assembler output as a reference genome and proceed with prior tools – challenges: contigs, errors; Visualizing: mauve, circos (may need install help) (Scott); cgview (Jeff).
• 30 Genome databases: Introduction to GMOD and/or SequenceServer (can we standup web servers on the class computers?) (Scott)
Notes from 5/17/12:
Conventions decided on - expands for hints, formats for command prompt/code
Aaron to write .sge maker script
All qsub's will run "./commands"
All examples have to have a "commands" file.