Tidyverse
Base R vs. tidyverse
The Tidyverse
Until now, we have been coding mostly using ‘base R’ functions, along with functions from maybe a couple of other packages. In this section, we will start working with the tidyverse package which is becoming increasingly popular in data science.
The tidyverse grammar follows a common structure in all of its functions. A set of verbs is used to facilitate the use of the functions, where the first argument is the object to be worked on.
Tidyverse is comprised of the following core packages:
dplyr | Grammar for data manipulation |
tidyr | Set of functions to create tidy data |
stringr | Function set to work with characters |
readr | An easy and fast way to import data |
forcats | Tools to easily work with factors |
ggplot2 | Grammar for creating graphics |
Pipe %>%
The functions can be used independently, but with the aim of combining various functions, the pipe operator passes the output of one function to the input of the next. This way of combining functions allows you to chain several steps simultaneously. In the simple example below, we pass the vector to the mean() function.
library(tidyverse)
vect1 <- str_c(1:100, rep(':', 100), sample(1000:2000, 100))
vect1 %>% str_split_i(':',2) %>%
as.numeric() %>% mean()
Table and vector manipulation
The dplyr package provides us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are shown below. Note that they are all named as recognizable verbs.
Function | Description |
|---|---|
mutate() | add new variables or modify existing ones |
select() | select variables, columns |
filter() | filter (e.g., by row) |
summarise() | summarize/reduce |
arrange() | sort |
group_by() | group |
rename() | rename columns |
Contrast between base R and Tidyverse:
Example: aggregate all of the vocalization types and take mean, create a new dataframe containing only data from Jocko.
dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t")
colnames(dataf) <- c('name', 'vocalization', 'min_sec')
sec1 <- as.numeric(str_split_i(dataf$min_sec, ':', 1))*60
sec2 <- as.numeric(str_split_i(dataf$min_sec, ':', 2))
#Base R
dataf$seconds <- sec1+sec2
mean_by_group_baseR <- aggregate(seconds ~ name + vocalization, data = dataf, FUN = mean)
mean_by_group_baseR_jocko <- mean_by_group_baseR[mean_by_group_baseR$name == 'Jocko',]
jocko_seconds <- mean_by_group_baseR_jocko$seconds
#Dplyr
mean_by_group_dplyr_jocko <- dataf %>%
mutate(seconds = sec1+sec2) %>%
group_by(name,vocalization) %>%
summarize(Mean_seconds = mean(seconds)) %>%
filter(name=="Jocko") %>%
as.data.frame()
#if you want to extract a column into a vector, use pull()
jocko_seconds <- mean_by_group_dplyr_jocko %>% pull(Mean_seconds)
#option, left merge
We sent the same dataframe into a series of base R manipulations on the one hand, and into a dplyer pipe expression on the other. What we ended up with were a dataframe for the one output and a tibble for the other output. You should realize that a tibble is different from a dataframe, although I won’t get into exactly how. However, a tibble is the dataframe structure used in the tidyverse, and it will not behave like a dataframe when attempting to use base R functions on it. Therefore, use as.data.frame(tibble) to recover a dataframe.
Last task: Pivoting a dataframe for plotting
In order to prepare data to leverage the power of ggplot2, it is necessary to convert data format from wide to long.
Wide format:
Long format:
Tidyverse:
mtcars_sub <- head(mtcars)
library(tidyverse)
mtcars_long <- mtcars_sub %>%
rownames_to_column(var = "model") %>%
pivot_longer(
cols = -c(mpg), # Select all columns except 'mpg, model' to pivot
names_to = "variable", # New column for the original column names
values_to = "value" # New column for the values from the original columns
)
Base R:
library(data.table)
mtcars_sub$model <- rownames(mtcars_sub)
mtcars_long <- melt(
data.table(mtcars_sub),
id.vars = c("model", "mpg"),
variable.name = 'variable',
measure.name = colnames(mtcars_sub)[colnames(mtcars_sub) != 'mpg']
)
ggplot2 is very powerful
mtcars_long <- mtcars %>%
rownames_to_column(var = "model") %>% # Convert row names (car models) to a column
pivot_longer(
cols = -c(mpg, model), # Select all columns except 'model', 'mpg' to pivot
names_to = "variable", # New column for the original column names
values_to = "value" # New column for the values from the original columns
)
ggplot(mtcars_long, aes(x = mpg, y = value, color = variable)) +
geom_point() +
facet_wrap(~ variable, scales = "free_y") +
labs(title = "mpg vs. variables", x = "Miles Per Gallon (mpg)", y = "Value") +
theme_minimal()
More on joining
Full (outer):
Inner:
Left:
Right: