Introduction
Your Instructors
Anna Battenhouse, abattenhouse@utexas.edu,
Biomedical Research Computing Facility Manager, and Marcotte lab staffBA English literature, 1978
Commercial software development 1982 – 2007
Joined Iyer Lab 2007 (“retirement career”)
BS Biochemistry, UT Austin, 2013
Joined the Biomedical Research Computing Facility (BRCF) and Marcotte Lab 2017
Also affiliated with
Matt Bramble, matthew.bramble@austin.utexas.edu,
Associate Research Scientist, Bioinformatics Consulting GroupMaster’s degrees from UT Austin in Molecular Biology and Statistics
10 years of experience with R and Python
Recently joined the CBRS Bioinformatics Consulting Group after six years at MD Anderson Cancer Center analyzing a wide range of NGS epigenomics data
Areas of expertise include: Hi-C (chromatin conformation) analysis, mouse somatic variant analysis, and single cell RNAseq analysis
About the Iyer Lab (where Anna learned NGS)
Dr. Vishy Iyer, PI | |
Main focus is functional genomics
| |
Research methods include
| |
|
Communication
Asking questions
Feel free to ask questions any time during the instructor's lecture and demonstrations.
For online attendees, you can also post your question to the Zoom chat. We'll sometimes use breakout rooms when troubleshooting problems you run into, if so, TA Matt Bramble will assign you to one.
Getting help
Since most folks are new to the Linux command line, we expect you to run into problems! Please let us know if you're having difficulties!
Making mistakes and running into problems is key to learning the Linux command line! It is not only expected – it is encouraged .
Conventions
If you see a block of text like this:
Example code block
ls -h
it means, type the command ls -h into a terminal window, hit Enter, and see what happens.
We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:
and some sections like this:
Course goals
Hands-on, tutorial style – learn by doing
Common bioinformatics tools & file formats
Introduce NGS vocabulary
both high-level view and practice with specific tools
Cover the NGS basics
The first few things you'll do after receiving raw sequences
raw sequence QC and preparation
alignment to reference
basic alignment analysis
Understand and practice required skills
Get you comfortable with Linux and TACC – your best "frenemies"
Make you self-sufficient enough in 5 days to become experts over time
Show some "best practices" for working with NGS data
NGS Challenges
Diverse skill set requirements
|
Large and growing datasets
NGS methods produce staggering amounts of data!
Typical dataset these days
yeast: 5 – 20 million reads
human: 20 – 250 million reads (~5 - 8 million for TagSeq)
single end (SE) or paired end (PE), length 50 – 300 bases (100 or 150 typical)
The initial FASTQ files are big (100s of MB to GB) – and they're just the start.
Organization and naming conventions are critical.
Your data can get out of hand very quickly!
Progression of Iyer Lab datasets over time:
2008 – Yeast heat shock remodeling of chromatin
2 yeast datasets
less than 2 million sequences
2010 – Allelic bias in CTCF binding
13 CTCF datasets from 3 GM cell lines
~200 million sequences
2012 – Transcription factor data analysis (ENCODE2)
32 ChIP-seq datasets gathered over 3 years (3 TFs across 11 cell lines)
~ 1 billion sequences
2013 – miRNA overexpression effects
42 RNAseq datasets (7 conditions)
~ 2.6 billion sequences
2014 – eQTL analysis of CTCF binding
52 very deeply sequenced CTCF datasets
~ 8 billion sequences
2018 – Functional analysis of glioblastoma tumors and cell lines
nearly 500 datasets in total (ChIP-seq, RNAseq, miRNAseq, 4C, exome/genome sequencing)
> 22 billion sequences