Tidyverse

Tidyverse

Base R vs. tidyverse

The Tidyverse

Until now, we have been coding mostly using ‘base R’ functions, along with functions from maybe a couple of other packages. In this section, we will start working with the tidyverse package which is becoming increasingly popular in data science.

 

The tidyverse grammar follows a common structure in all of its functions. A set of verbs is used to facilitate the use of the functions, where the first argument is the object to be worked on.

Tidyverse is comprised of the following core packages:

dplyr

Grammar for data manipulation

tidyr

Set of functions to create tidy data

stringr

Function set to work with characters

readr

An easy and fast way to import data

forcats

Tools to easily work with factors

ggplot2

Grammar for creating graphics

Pipe %>%

The functions can be used independently, but with the aim of combining various functions, the pipe operator passes the output of one function to the input of the next. This way of combining functions allows you to chain several steps simultaneously. In the simple example below, we pass the vector to the mean() function.

library(tidyverse) vect1 <- str_c(1:100, rep(':', 100), sample(1000:2000, 100)) vect1 %>% str_split_i(':',2) %>% as.numeric() %>% mean()

 

Table and vector manipulation

The dplyr package provides us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are shown below. Note that they are all named as recognizable verbs.

Function

Description

Function

Description

mutate()

add new variables or modify existing ones

select()

select variables, columns

filter()

filter (e.g., by row)

summarise()

summarize/reduce
(Summarizes per group when used with group_by)

arrange()

sort

group_by()

group

rename()

rename columns

A basic example using group_by() and summarize()

library(tidyverse) data(iris) #may need to change iris %>% group_by(Species) %>% summarize( sample_size = n(), avg_sepal_width = mean(Sepal.Width), sd_sepal_width = sd(Sepal.Width) )

Contrast between base R and Tidyverse

Example: Find the mean of each unique vocalization type, create a new dataframe containing only data from Jocko.

 

image-20251114-184059.png

 

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t") colnames(dataf) <- c('name', 'vocalization', 'min_sec') sec1 <- as.numeric(str_split_i(dataf$min_sec, ':', 1))*60 sec2 <- as.numeric(str_split_i(dataf$min_sec, ':', 2)) #Base R dataf$seconds <- sec1+sec2 mean_by_group_baseR <- aggregate(seconds ~ name + vocalization, data = dataf, FUN = mean) mean_by_group_baseR_jocko <- mean_by_group_baseR[mean_by_group_baseR$name == 'Jocko',] jocko_seconds <- mean_by_group_baseR_jocko$seconds #Dplyr mean_by_group_dplyr_jocko <- dataf %>% mutate(seconds = sec1+sec2) %>% group_by(name,vocalization) %>% summarize(Mean_seconds = mean(seconds)) %>% filter(name=="Jocko") %>% as.data.frame() #if you want to extract a column into a vector, use pull() jocko_seconds <- mean_by_group_dplyr_jocko %>% pull(Mean_seconds) #option, left merge

 

We sent the same dataframe into a series of base R manipulations on the one hand, and into a dplyer pipe expression on the other. What we ended up with were a dataframe for the one output and a tibble for the other output. You should realize that a tibble is different from a dataframe, although I won’t get into exactly how. However, a tibble is the dataframe structure used in the tidyverse, and it will not behave like a dataframe when attempting to use base R functions on it. To convert back to a dataframe run the following: as.data.frame(tibble_df).

Last task: Pivoting a dataframe for plotting

In order to prepare data for easier use with ggplot2, it is necessary to convert data format from wide to long.

 

Wide format:

image-20251114-052235.png

Long format:

image-20251114-052328.png

Tidyverse:

library(tidyverse) head(mtcars) mtcars_wide <- mtcars mtcars_long <- mtcars_wide %>% rownames_to_column(var = "model") %>% pivot_longer( cols = -c(model), # Select all columns except 'model' to pivot names_to = "variable", # New column for the original column names values_to = "value" # New column for the values from the original columns )

R (using reshape from data.table library):

library(data.table) mtcars_sub$model <- rownames(mtcars_sub) mtcars_long <- melt( data.table(mtcars_sub), id.vars = c("model", "mpg"), variable.name = 'variable', measure.name = colnames(mtcars_sub)[colnames(mtcars_sub) != 'mpg'] )

ggplot2 is very powerful

A simple bar plot

ggplot(mtcars, aes(x = cyl)) + geom_bar(fill = "steelblue", color = "black") + labs(title = "Number of Cars by Cylinder Type", x = "Number of Cylinders", y = "Count") + theme_minimal()

With more complicated plots, it’s better to pivot the data

mtcars_long <- mtcars %>% rownames_to_column(var = "model") %>% # Convert row names (car models) to a column pivot_longer( cols = -c(mpg, model), # Select all columns except 'model', 'mpg' to pivot names_to = "variable", # New column for the original column names values_to = "value" # New column for the values from the original columns ) ggplot(mtcars_long, aes(x = mpg, y = value, color = variable)) + geom_point() + facet_wrap(~ variable, scales = "free_y") + labs(title = "mpg vs. variables", x = "Miles Per Gallon (mpg)", y = "Value") + theme_minimal()

Filter before adding labels and plotting

mtcars_long %>% filter(variable %in% c("wt")) %>% ggplot(aes(x = mpg, y = value, color = variable, label = model)) + geom_point() + facet_wrap(~ variable, scales = "free_y") + labs(title = "mpg vs. variables", x = "Miles Per Gallon (mpg)", y = "Value") + theme_minimal() + geom_text_repel()

Example of using left-join

left_data <- head(iris) %>% mutate(id = row_number()) # Create an ID column for merging right_data <- tibble( id = c(1, 2, 99), # 1 and 2 match; 99 does not Sunlight_Hours = c(8, 6, 10), Soil_Type = c("Loamy", "Clay", "Sandy") ) # left join merged_data <- left_join(left_data, right_data, by = "id")

More on joining

Full (outer):

image-20251116-191121.png

Inner:

image-20251116-191037.png

Left:

image-20251116-191221.png

Right:

image-20251116-191252.png

Note an error with this last figure: the second line of the last result table should read: 4 | L4 | R2