Content Comparison

Table of Contents

Your Instructors

Most of us are members (or alumni) of the functional genomics lab of Vishwanath Iyer, UT Austin.

Anna Battenhouse
, Associate Research Scientist, Iyer Lab
, abattenhouse@utexas.edu,
Biomedical Research Computing Facility Manager, and Marcotte lab staff
- BA English literature, 1978
- Commercial software development 1982 –
  2005
  2007
- Joined Iyer Lab 2007 (“retirement career”)
- BS Biochemistry, UT Austin, 2013
Amelia Weber Hall, Graduate Student, Iyer Lab, ameliahall@utexas.edu
- 5th year Microbiology graduate student
- Laboratory Technician at UT 2007-2010
- BS Molecular Genetics, 2007
Nathan Abell, Research Assistant, Xhemalce Lab, abell.nathan@gmail.com
- Undergraduate researcher in Iyer Lab 2011-2013
- BS Molecular Biology, UT, 2013
- Research Assistant
Dakota Derryberry, Graduate Student, Wilke Lab, dakotaz@utexas.edu
- ???

...

- Joined the Biomedical Research Computing Facility (BRCF) and Marcotte Lab 2017
- Also affiliated with
  - Bioinformatics Consulting Group (BCG)
  - Genome Sequencing and Analysis Facility (GSAF)
  - Cryo Electron Microscopy core facility (CryoEM)
  - Edward Marcotte lab
Matt Bramble, matthew.bramble@austin.utexas.edu,
Associate Research Scientist, Bioinformatics Consulting Group
- Master’s degrees from UT Austin in Molecular Biology and Statistics
- 10 years of experience with R and Python
- Recently joined the CBRS Bioinformatics Consulting Group after six years at MD Anderson Cancer Center analyzing a wide range of NGS epigenomics data
- Areas of expertise include: Hi-C (chromatin conformation) analysis, mouse somatic variant analysis, and single cell RNAseq analysis

About the Iyer Lab (where Anna learned NGS)

http://iyerlab.org/

Dr. Vishy Iyer, PI

Image Added

Main focus is functional genomics

- large-scale

...

- transcriptional reprogramming
  in response to diverse stimuli
- Encode consortium collaborator

...

works in human and yeast
Research methods include microarrays (Dr. Iyer was co-inventor)	Image Modified
high-throughput sequencing (since 2007) especially ChIP-seq, RNA-seq also

...

- miRNA-seq, RIP-seq, MNase-seq ...

...

- >2,000 NGS datasets

Image Modified

Communication

Post its

Green post-it – I'm good at the moment.

Pink post-it – I need a bit of help.

Conventions

...

Asking questions

Feel free to ask questions any time during the instructor's lecture and demonstrations.

For online attendees, you can also post your question to the Zoom chat. We'll sometimes use breakout rooms when troubleshooting problems you run into, if so, TA Matt Bramble will assign you to one.

Getting help

Since most folks are new to the Linux command line, we expect you to run into problems! Please let us know if you're having difficulties!

Making mistakes and running into problems is key to learning the Linux command line! It is not only expected – it is encouraged .

Conventions

If you see a block of text like this:

Code Block

language	bash
title	Example code block

ls -h

it means, "type the command ls -h into a terminal window, hit return Enter, and see what happens".

We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:

Expand

Hint

title	Hint...

Hint sections will provide you some guidance on what to do next, but will not spell it out.

and some sections like this:

Expand

Solution

title	Solution...

Solution sections will contain the commands so that you could copy-and-paste them if you have to. They should be exactly accurate.

Goals and challenges

will represent one method of answering the question – but there are often many ways to skin a cat!

Course goals

Hands-on, tutorial style – learn by doing
- Common bioinformatics tools & file formats
Introduce NGS vocabulary
- both high-level view and practice with specific tools
Cover the NGS tool basics – the
- The first few things you'll do after receiving raw sequences
  - raw sequence QC and preparation
  - alignment to reference
  - basic alignment analysis
Understand and practice required skills
- Get you comfortable with Linux and TACC – your best "frenemies"
- Make you self-sufficient enough in
4
- 5 days to become experts over time
- Show some "best practices" for working with NGS data

Image Added

NGS Challenges

Diverse skill set requirements

Analysis – making sense of raw data
- one part bioinformatics and statistics
- one part scripting / programming
  - Linux command line
  - High Performance Computing (TACC)
  - bash scripting (grep, awk)
  - R, python, perl
Management – making order out of chaos
- one part organization
- one part data wrangling
Adoption of best practices is critical!

Image Added

Large and growing datasets

NGS methods procude produce staggering amounts of data!

Typical dataset these days

yeast: 5 – 20 million reads
human: 20 – 100 250 million reads (~5 - 8 million for TagSeq)
single end (SE) or paired end (PE), length 75 50 – 100 bases300 bases (100 or 150 typical)

The initial fastq FASTQ files are big (100s of MB to GB) – and they're just the start.

Organization and naming conventions are critical.
Your data can get out of hand very quickly!

...

Progression of Iyer Lab

...

datasets over time:

...

2008 – Yeast heat shock remodeling of chromatin
- 2 yeast datasets
- less than 2 million readssequences
2010 – Allelic bias in CTCF binding
- 13 CTCF datasets from 3 GM cell lines
- ~200 million readssequences
2012 – Analysis of Transcription factor data analysis (ENCODE2)
- 32 ChIP-seq datasets gathered over 3 years (3 TFs across 11 cell lines
- 32 datasets gathered over 3 years
- ~ 1 billion reads
2014 – QTL
- )
- ~ 1 billion sequences
2013 – miRNA overexpression effects
- 42 RNAseq datasets (7 conditions)
- ~ 2.6 billion sequences
2014 – eQTL analysis of CTCF binding
- 52 very deeply sequenced CTCF datasets
- ~ 8 billion readssequences
in progress 2018 – Functional analysis of glioblastoma tumors and cell lines
- > 300 datasets so far
- > 17 billion reads

Data wrangling best practices summary

keep fastq files compressed

Most sequencing facilities will give you compressed sequencing data files
- gzip format (.gz extension) for individual files
- tar or zip format for directories of files
Even with compression it's easy to run out of storage space!

You may be tempted un-compress your sequencing files to manipulate them more directly

resist the temptation to gunzip!
nearly all modern bioinformatics tools are able to work on .gz files
there are techniques for working with compressed files without ever un-compressing them

arrange adequate storage space

Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
Stage your active projects on corral
- copy data to $WORK or $SCRATCH for analysis
- copy important analysis products back to corral
Periodically back up corral directories to ranch tape archive

backup analysis artifacts regularly

Obtain an allocation on TACC's ranch tape archive system
- 10 TB a good initial number
- free! and under-utilized
Periodically back up your corral directories to ranch tape archive

distinguish between types of data

Artifacts from different stages of the analysis will have different archival requirements.

Original sequence data (fastq files)
- must be backed up!
Alignments
- usually larger than original fastqs
- should be backed up once stable
Peak calling artifacts
Downstream analysis artifacts

While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted after publication.

track your analysis steps

Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.

Keep "work files" that detail analysis steps performed
- here's an /wiki/spaces/CcbbShortChipSeq/pages/52826834

- nearly 500 datasets in total (ChIP-seq, RNAseq, miRNAseq, 4C, exome/genome sequencing)
- > 22 billion sequences

Version	Old Version 12	New Version Current
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	May 21, 2015	Jun 01, 2025

Versions Compared

Key

Your Instructors

Communication

Post its

Conventions

Asking questions

Getting help

Conventions

Goals and challenges

Course goals

NGS Challenges

Diverse skill set requirements

Large and growing datasets

Data wrangling best practices summary

keep fastq files compressed

arrange adequate storage space

backup analysis artifacts regularly

distinguish between types of data

track your analysis steps