In this lab, we'll look at how to use an R package called goseq to identify enriched gene ontology (GO) terms. For this analysis, we'll be using the differential analysis results we generated using DESeq.
(Why are we using the list of differentially expressed genes produced by DESeq and not tophat?)
goseq is an R package that provides functions to look for enriched gene ontology terms in our differentially expressed genes.
WHAT ARE GO TERMS?
GO terms provide a standardized vocabulary to describe genes and gene products from different species. GO terms allow us to assign functionality to genes. The following properties are described for gene products:
WHAT IS GO ENRICHMENT?
WHAT DOES THIS MEAN TO US?
Many genes may be changing, but they may all be linked to similar biological processes. From a list of changing genes -> list of affected biologial processes. We can better elucidate the biological events that are represented by our differential gene finding.
We also reduce the dataset considerably- from large number of genes to a smaller number of functions/processes.
INPUT TO GOSEQ
OTHER NOTES ABOUT GOSEQ
What are your other options?
If you are trying to find enriched GO terms using a list of novel transcripts, BLAST2GO, AmiGO
Web services like DAVID
Get set up
Make sure you are in the right location cd $SCRATCH/my_rnaseq_course/diff_results |
A) First load R and enter R environment
#module load R will not work for this R package because its needs the latest version of R. So, we'll load a local version. $BI/bin/R |
B) Within R environment, install package cummeRbund
source("http://bioconductor.org/biocLite.R")
biocLite("goseq")
#package to pull out annotated information about our genome and genes
biocLite("geneLenDataBase")
#package to load the GO terms specific to drosophilia
biocLite("org.Dm.eg.db") |
C) While that is installing, let's look at how your input files need to look.
FROM DESeq output:
"","id","baseMean","baseMeanA","baseMeanB","foldChange","log2FoldChange","pval","padj"
"131","FBgn0000370",7637.91654540105,4217.77033402576,11058.0627567763,2.62177925326286,1.39054621964443,1.2887282997047e-116,7.22484613473489e-113
"2489","FBgn0025682",6038.35042952997,3300.21617337019,8776.48468568974,2.65936660649935,1.41108267336748,1.36704751839828e-116,7.22484613473489e-113
......
INPUT FILE 1: DEG (contains the 84 genes that meet our fold change an p value cut offs)
FBgn0000370
FBgn0025682
FBgn0086904
.....
INPUT FILE 2: ALL (contains all14869 genes)
FBgn0000370
FBgn0025682
FBgn0086904
.....
C) Load all the libraries.
library("goseq")
library("geneLenDataBase")
library("org.Dm.eg.db") |
D) Read in our two input files
DEG<-read.table("DEG", header = FALSE)
ALL<-read.table("ALL", header = FALSE) |
E) Convert the DEG and ALL data objects to vectors
class(DEG) DEG.vector <- c(t(DEG)) ALL.vector<-c(t(ALL)) |
F) Construct a new vector that adds a 0 next to every gene that is not in our DEG list and a 1 next to every gene that is in our DEG list.
gene.vector=as.integer(ALL.vector%in%DEG.vector) #lets explore this new vector a bit head(gene.vector) tail(gene.vector) |
G) Weigh the gene vector by lengths of our genes. This length information is pulled from a different package and we need to specifying the exact name of our genome and gene IDs in that package.
pwf=nullp(gene.vector,"dm3","ensGene") |
F) Find enriched GO terms
GO.wall=goseq(pwf,"dm3","ensGene") #How many enriched GO terms do we have head(GO.wall) class(GO.wall) nrow(GO.wall) |
G) Find only enriched GO terms that are statistically significant
enriched.GO=GO.wall$category[GO.wall$over_represented_pvalue<.05] #NOTE: They recommend using a more stringent multiple testing corrected p value here #How many GO terms do we have now? head(enriched.GO) class(enriched.GO) length(enriched.GO) |
H) But these are just IDs - Get more biologically meaningful results
library(GO.db)
capture.output(for(go in enriched.GO[1:10]){print(GOTERM[[go]])
cat("--------------------------------------\n")} file="SigGO.txt") |
I) OUTPUT
less SigGo.txt |
GOID: GO:0030336
Term: negative regulation of cell migration
Ontology: BP
Definition: Any process that stops, prevents, or reduces the frequency,
rate or extent of cell migration.
Synonym: down regulation of cell migration
Synonym: down-regulation of cell migration
Synonym: downregulation of cell migration
Synonym: inhibition of cell migration
--------------------------------------
GOID: GO:0040013
Term: negative regulation of locomotion
Ontology: BP
Definition: Any process that stops, prevents, or reduces the frequency,
rate or extent of locomotion of a cell or organism.
Synonym: down regulation of locomotion
Synonym: down-regulation of locomotion
Synonym: downregulation of locomotion
Synonym: inhibition of locomotion
--------------------------------------