Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Introduction 

The purpose of this guide is to aid current and future Whiteley Lab members and University of Texas microbiologists with bacterial RNA?Seq analysis. Once you have analyzed your data with this pipeline, you will have files identifying differentially expressed genes and files that can be used to identify novel non-coding RNAs, transcriptional start sites, and operons. Throughout this guide I will provide hyperlinks to valuable resources and programs that will help get your analysis off the ground and running.

...

This pipeline is set up to analyze RNA-Seq libraries prepared with the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB #E7300S) and is currently limited to Pseudomonas aeruginosa PAO1, P. aeruginosa PA14, Aggregatibacter actinomycetemcomitans D7S-1, Streptococcus gordonii Challis CH1, and Escherichia coli K12 W3110. The pipeline can be easily modified in the future to accommodate new library preparation methods and new bacterial strains. Parts of the pipeline that can be updated will be annotated in this guide.

Methods Overview

A basic RNA-Seq experiment will have two experimental conditions (i.e. treated and untreated, wild type and mutant, control and test, etc.), with two biological replicates for each experimental condition. Obviously, increasing the number of replicates will increase your statistical significance. To begin analysis, you will need your fastq sequencing read files for all of your conditions and replicates. The files are processed by a series of scripts that I have written. These scripts produce files containing the mapped sequencing reads, differentially expressed gene tables, and several graphs from the differential gene expression analyses.

  1. mapRNASeq.sh - This script reads in a fastq file, producing a sorted bam file and indexed bam file that can be read into the Integrative Genomics Viewer and a counts file that contains the number of reads mapping to each gene in your genome.
  2. calcRNASeq.sh - This script reads in your individual counts files, joins them into a table, produces another table with the differential expression of all of your genes, and several graphs that summarize these results.

Getting Started 

Obtain fourierseq and lonestar accounts

...

Code Block
     # Change to your home directory
     cdh

     # Change to your work directory
     cdw

     # Change to your scratch directory
     cds

The Analysis

Download your sequencing data and upload to lonestar

The fastest way to download your files from fourierseq to lonestar is by using the UNIX program scp. This allows you to copy files from a secure location (i.e. fourierseq) to your current workspace (i.e. lonestar). Here is how I would transfer my files from fourierseq.

...

There are several ways to transfer your fastq files from fourierseq to lonestar for your analysis. An alternate easier (but much slower) way to transfer files is to use a program with a graphical user interface to log into fourierseq, transfer your files to your personal computer, and then upload the files to lonestar for analysis. On a mac you can do this with a program called Cyberduck (Cyberduck download); however, FileZilla (Download FileZilla) also works very well and is available for PC. With either of these programs, set up an SFTP (Secure File Transfer Protocol) connection to your account at lonestar.tacc.utexas.edu, and then you can drag and drop files from your computer to the server, or from the server to your computer, just like you would normally in your computer’s file system. Remember you will need to load your files into your scratch directory on lonestar for your analyses.
 

mapRNASeq.sh: mapping your sequencing reads

Basic idea:

USAGE: mapRNASeq.sh in_file out_pfx assembly threads(x)

...

            A. actinomycetemcomitans D7S-1                   AAD7S
            E. coli K12 W3110                                          ECK12W3110
            P. aeruginosa PAO1                                       PAO1
            P. aeruginosa PA14                                        PA14
            S. gordonii Challis CH1                                   SGCH1

How to run it:

To run the script properly on lonestar you have to run the script from a commands file in the directory containing your fastq files. So create a text file called commands containing the following text. This can be done in the Unix terminal using nano.

Code Block
# Open nano and save your file as “commands”. Files are saved in nano with CTRL+o
#    and you can exit nano with CTRL+x. In the example below we have a control
#    condition and a test condition with two replicates each. We are mapping to the
#    PAO1 genome and using 12 processing threads (on lonestar always use 12).
nano


            mapRNASeq.sh Control1.fastq Control1 PAO1 12
            mapRNASeq.sh Control2.fastq Control2 PAO1 12
            mapRNASeq.sh Test1.fastq Test1 PAO1 12
            mapRNASeq.sh Test2.fastq Test2 PAO1 12

...

Test2.trim.fastq
Test2.sam
Test2.sorted.bam
Test2.sorted.bam.bai
Test2.count.txt
Test2.log.txt

calcRNASeq.sh: joining your count files and calculating differential expression

Basic idea:

USAGE: calcRNASeq.sh -o OUT_PFX -c CONTROL_PFX -x [#] -t TEST_PFX -y [#] <file1> <file2> <file3> … <filen>

-o OUT_PFX                substitute “OUT_PFX” with the prefix you want for all of your output file
-c CONTROL_PFX       substitute “CONTROL_PFX” with the prefix for your control condition
-x [#]                            substitute “[#]” with the number of control condition replicates
-t TEST_PFX                substitute “TEST_PFX” with the prefix for your test condition
-y [#]                            substitute “[#]” with the number of test condition replicates
<file1>… <filen>          list your count files, with control conditions listed first followed by test condition files

How to run it:

This is a little simpler to run than the mapRNASeq.sh. You don’t need to create a commands file. You can run it directly from the command line, because it does not require as much computational power. Make sure you use the flags (i.e. “-o”, “-c”, etc).

Code Block
# An example, continuing from mapRNASeq.sh above
calcRNASeq.sh -o Exp1 -c Control -x 2 -t test -y 2 Control1.count.txt Control2.count.txt Test1.count.txt Test2.count.txt

This script takes any number of count files and joins them together in one count table, where the first column is the locus tags for the genome and the following columns contain the read counts for each locus for each condition. Then, once the table is created in the proper format it determines differential expression using the R package DESeq. This takes your joined count file with all of your conditions and replicates, normalizes the total counts for each condition/replicate, and calculates differential gene expression using Fisher’s exact test and a negative binomial distribution.

What do I get out of this?

...

Exp1_dispEsts.png
Exp1_DEplot.png
Exp1_pvals.png

Post-pipeline analyses

Differential expression analysis

...

The first graph shows how well your variance estimate models your data, the second summarizes the differential expression data, and the third graph is a histogram of the distribution of p-values for the differential expression of all of the genes. The graphs are most informative or general trends in the data. 

Identifying novel non-coding RNAs

Finally, if you are interested in looking for novel non-coding RNAs that may not be annotated in your genome, you can download your .sorted.bam and .sorted.bam.bai files to view with the Integrative Genomics Viewer (IGV webpage). You will need the gff annotation and fasta sequence files for your genome that you mapped your reads to. Once these are loaded and your annotated reference genome is visible, you can open your bam file in IGV (make sure that the .sorted.bam.bai file is in the same directory, otherwise it will not work). You can do this for all of your bam files, and each condition will load as a separate “track” in IGV.

How it works

mapRNASeq.sh: the details

This script is a Unix shell script that processes a fastq file with three different programs. First it uses Flexbar (Flexbar webpage) to remove adapter sequence contamination from the read files. The trimmed fastq file is then mapped to the reference genome using bowtie2 (bowtie2 webpage). The file containing the mapped read information is subsequently processed with 1) SAMtools (SAMtools webpage) to prepare the reads to be viewed against the genome and 2) HT-Seq (HT-Seq webpage) to count the number of reads mapping to each gene in the genome. The following list gives the details for each step in the script.

...

(*These are the two files that need to be in the same folder to view the reads mapping to the genome with IGV) 

calcRNASeq.sh: the details 

After mapRNASeq.sh produces your .count.txt file, you need to combine that with your other replicates so that you can calculate differential gene expression using DESeq. 

...

Next, the shell script reads your joined count file as well as several parameters into an R script that runs DESeq (DESeq webpage). DESeq is a Bioconductor R package for RNA-Seq differential gene expression analysis. There are not many tricks to this script; it essentially goes through all of the steps in the vignette, and produces a number of useful tables and graphs that summarize your data.

How to update the pipeline

Adding new reference genomes 

So far I only have this set up on the server to run with P. aeruginosa _PAO1, _Aa _D7S-1, _Sg Challis CH1, P. aeruginosa PA14, and E. coli K12_ _W3110 . If other genomes are desired in the future the following directories and files need to be created (email me pjorth at gmail dot com, and I can help you!):

...

Code Block
# Change into the ref_genome directory
cd /home1/02173/pjorth/ref_genome/

# Make the new directory
mkdir NEWASSEMBLY 

Example:
/home1/02173/pjorth/ref_genome/PAO1

...

Note, sometimes you will want to count reads mapping to genes instead of CDS features, and in these cases you will change the text to FEATURE=”gene”.

Presentation from UT BYTE Club meeting 20 March 2013

This powerpoint goes over some of the basics of BacSeq, including generally how it works and what you will end up getting from the pipeline.

...