Content Comparison

...

(Why are we using the list of differentially expressed genes produced by DESeq and not tophatcuffdiff?)

Introduction

goseq is an R package that provides functions to look for enriched gene ontology terms in (GO) in our differentially expressed genes.

...

cellular component, describes where in a cell a gene acts, what cellular unit the gene is part of
molecular function, describes the function carried out by the gene, such as binding or catalysis;
biological process, a set of molecular functions, with a defined beginning and end, makes up a biological process. This describes biological phenomenon like DNA replication.

All GO terms have an ID that looks like GO:0006260.

All GO terms have alist of genes that belong to that particular term.

WHAT IS GO ENRICHMENT?

For GO enrichment, we take the following things into account:

A. Total number of genes we are looking at.
B. Number of genes of interest, that is, in our DEG list.
C. Total number of genes in the GO term
D. Number of genes from our genes of interest that are also in the GO term.

If the number of genes from our list that belong to GO term GO:0001 (D) is significant compared to the total number of genes in that GO term (C) and the total number of genes in our experiment (A), we consider that GO term to be enriched in our data.

WHAT DOES THIS MEAN TO US?

...

We also reduce the dataset considerably- from large number of genes to a smaller number of functions/processes.

We go from up and downregulated genes between two conditions to up and down regulated processes between two conditions.

INPUT TO GOSEQ

A list of all genes tested.
A list of just the genes of interest, in this case, significantly differentially expressed genes.

...

It needs information about our genome, particularly length of genes. This is available in a particular bioconductor package for many model genomes.
If your genome is not one of the genomes supported by that package, you can attempt to create the requeired required files about your genome using commands mentioned in the goseq manual.
goseq can also be used to identify enriched pathways, like KEGG pathways

...

Code Block

source("http://bioconductor.org/biocLite.R")
biocLite("goseq")

#package to pull out annotated information about our genome and genes  

biocLite("geneLenDataBase")   				

#package to load the GO terms specific to drosophilia
biocLite("org.Dm.eg.db")

...

Code Block
class(DEG) DEG.vector <- c(t(DEG)) ALL.vector<-c(t(ALL))

Data frame:

FBg..
FBg...
FBg...

Transpose:

FBg...	FBg...	FBg...

Vector:

FBg..., FBg..., FBg...

F) Construct a new vector that adds a 0 next to every gene that is not in our DEG list and a 1 next to every gene that is in our DEG list.

Code Block
gene.vector=as.integer(ALL.vector%in%DEG.vector) names(gene.vector)=ALL.vector #lets explore this new vector a bit head(gene.vector) tail(gene.vector)

...

Code Block
GO.wall=goseq(pwf,"dm3","ensGene") #How many enriched GO terms do we have headclass(GO.wall) classhead(GO.wall) nrow(GO.wall)

G) Find only enriched GO terms that are statistically significant

Code Block

enriched.GO=GO.wall$category[GO.wall$over_represented_pvalue<.05]
#NOTE: They recommend using a more stringent multiple testing corrected p value here
 
#How many GO terms do we have now?
headclass(enriched.GO)
classhead(enriched.GO)
length(enriched.GO)

...

Code Block
library(GO.db) capture.output(for(go in enriched.GO[1:10258]) { print(GOTERM[[go]]) cat("--------------------------------------\n") } , file="SigGOSigGo.txt")

I) OUTPUT

Code Block
less SigGo.txt

...

Version	Old Version 4	New Version 5
Changes made by	Dhivya Arasappan	Dhivya Arasappan
Saved on	May 21, 2014	May 22, 2014

Versions Compared

Key

Introduction