ggplot2

ggplot2

The objective of this section is to understand some basic principles of ggplot2.

ggplot2 is an R graphics package that simplifies the tasks involved in rapidly producing high quality visualizations. The power of this package is that plots are built in layers, and the commands are ‘modular’, which makes it possible to produce different types of visualizations, with small changes in code. This makes it easy to reuse parts of the code for very different figures.

 

Because it is part of the tidyverse collection, ggplot2 works best with data organized so that individual observations are in rows and variables are in columns. We will not go into how ggplot2 interacts with other tidyverse commands. We will be constructing a dataframe on our own.

Example Data

The example data we will use for plotting are from a bulk RNA-Seq experiment that is available in the Bioconductor package airway. In this experiment, the authors were comparing transcriptomic differences in primary human ASM cell lines treated with dexamthasone, a common therapy for asthma. Each cell line included a treated and untreated negative control resulting in a total sample size of 8.

Run the following to install packages, load libraries, and perform analysis with a differential expression package DESeq2. We are using DESeq2 as an example of a bioconductor package.

BiocManager::install('airway') library(airway) library(DESeq2) data("airway") DE_airway <- DESeqDataSet(airway, design = ~ cell + dex) dds <- DESeq(DE_airway) counts_norm = counts(dds, normalized=T) str(counts_norm) counts_norm = as.data.frame(counts(dds, normalized=T)) str(counts_norm)

 

To get the raw counts, we can use the following code, which is specific to the SummarizedExperiment object.

counts_raw <- as.data.frame(assays(airway)$counts)

Note what is returned if we do not convert to a dataframe:

counts_raw <- assays(airway)$counts typeof(counts_raw)

 

The count data is the total number of fragments in teh RNAseq library that mapped to a particular gene. We are going to construct a dataframe that will give us information for each sample about the number of genes that have counts != 0 in a sample and the total number of counts of all genes in the sample.

The function colSums() returns a vector containing the total column sum for each column.

num_genes_norm = colSums(counts_raw != 0) num_genes_raw = colSums(counts_raw != 0) total_counts_norm = colSums(counts_norm) total_counts_raw = colSums(counts_raw)

 

Also, we need more information about the samples. In the SummarizedExperiments object, that is commonly in a ‘slot' called colData. Let’s get that dataframe and pull it out of RStudio

col_data = as.data.frame(colData(airway))

 

Now to finalize the dataframe, add the count information to the col_data dataframe. We will just use the vector name as the column name, but you can use any other name, if you like.

col_data$num_genes_norm = num_genes_norm col_data$num_genes_raw = num_genes_raw col_data$total_counts_norm = total_counts_norm col_data$total_counts_raw = total_counts_raw

 

Plot the data

#Run this first (to plot the normalized data) ggplot(data=col_data) + geom_point(aes(x=num_genes_norm, y = total_counts_norm)) #Next run this (to plot the raw data) ggplot(data=col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw))

 

With ggplot, plots are built up in ‘layers’ which are added using the '+' symbol. For example, to add trend line to the above plot, add another feature as follows:

ggplot(data=col_data, aes(x=num_genes_raw, y = total_counts_raw)) + geom_point() + geom_smooth(method=lm)

Types of main plots include the following:

  • scatter plots (geom_point()),

  • line plots (geom_line(),geom_path()),

  • bar plots (geom_bar(), geom_col()),

  • line modeled to fitted data (geom_smooth()),

  • heat maps (geom_tile()),

  • geographic maps (geom_polygon())

If appropriate, you can simple substitute one plot type for another.

#line plot ggplot(data=col_data, aes(x=num_genes_raw, y = total_counts_raw)) + geom_line() + geom_smooth(method=lm)

 

Some additional atrributes that can be included to make our plot look better are shown below.

Adding the color or shape arguments to our mapping aesthetic. Note the locations of the attributes relative to the parenthesis.

ggplot(data=col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw, color=dex)) ggplot(col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw, shape=dex), color="blue")

 

Any plot can be saved as an object, and layers can be added to the object. Here we are specifying the colors we want, and also the labels for the color legend.

scatter_plot <- ggplot(col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw, color=dex)) scatter_plot = scatter_plot + scale_color_manual(values=c("red","black"), labels=c('treated','untreated'))

 

There are also packages for color palettes that can be used, along with built-in palettes in R. For example, the package ‘viridis’ provides colorblind friendly palettes.

library(viridis) ggplot(col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw, color=dex)) + scale_color_viridis(discrete=TRUE, option="viridis")

 

Let’s combine a number of options and layers to make a final plot that could be used for a presentation.

 

library(gridExtra) p1 = ggplot(col_data) + geom_point(aes(x=num_genes_raw, y = total_counts_raw, fill=dex), shape=21,size=2) + scale_fill_manual(values=c("purple", "yellow"), labels=c('treated','untreated'))+ labs(x="Recovered genes per sample", y="Total sequences per sample", fill="Treatment") + theme_bw() #add a complete theme black / white p2 = ggplot(col_data) + geom_point(aes(x=num_genes_norm, y = total_counts_norm, fill=dex), shape=21,size=2) + scale_fill_manual(values=c("purple", "yellow"), labels=c('treated','untreated'))+ labs(x="Recovered genes per sample", y="Total sequences per sample", fill="Treatment") + theme_bw() #add a complete theme black / white grid.arrange(p1, p2, nrow = 1)

Save a plot to pdf

pdf('saved plot', width = 4, height = 4) print(p1) dev.off()