dev_R/Rstudio part II

Base R vs. tidyverse

The Tidyverse

Until now, we have been coding mostly using ‘base R’ functions, along with functions from maybe a couple of other packages. In this section, we will start working with the tidyverse package which is becoming increasingly popular in data science.

The tidyverse grammar follows a common structure in all of its functions. A set of verbs is used to facilitate the use of the functions, where the first argument is the object to be worked on.

Tidyverse is comprised of the following core packages:

dplyr	Grammar for data manipulation
tidyr	Set of functions to create tidy data
stringr	Function set to work with characters
readr	An easy and fast way to import data
forcats	Tools to easily work with factors
ggplot2	Grammar for creating graphics

Pipe %>%

The functions can be used independently, but with the aim of combining various functions, the pipe operator passes the output of one function to the input of the next. This way of combining functions allows you to chain several steps simultaneously. In the simple example below, we pass the vector to the mean() function.

library(tidyverse)
vect1 <- str_c(1:100, rep(':', 100), sample(1000:2000, 100))
vect1 %>% str_split_i(':',2) %>% 
as.numeric() %>% mean()

Table and vector manipulation

The dplyr package provides us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are shown below. Note that they are all named as recognizable verbs.

Function	Description

Function	Description
mutate()	add new variables or modify existing ones
select()	select variables, columns
filter()	filter (e.g., by row)
summarise()	summarize/reduce (Summarizes per group when used with group_by)
arrange()	sort
group_by()	group
rename()	rename columns

Contrast between base R and Tidyverse:

Example: aggregate all of the vocalization types and take mean, create a new dataframe containing only data from Jocko.

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t")
colnames(dataf) <- c('name', 'vocalization', 'min_sec')
sec1 <- as.numeric(str_split_i(dataf$min_sec, ':', 1))*60 
sec2 <- as.numeric(str_split_i(dataf$min_sec, ':', 2))
dataf$seconds <- sec1+sec2
dataf

mean_by_group_baseR <- aggregate(seconds ~ name + vocalization, data = dataf, FUN = mean)
mean_by_group_baseR_jocko <- mean_by_group_baseR[mean_by_group_baseR$name == 'Jocko',]
jocko_seconds <- mean_by_group_baseR_jocko$seconds


mean_by_group_dplyr_jocko <- dataf %>%
   group_by(name,vocalization) %>%
   summarize(Mean_seconds = mean(seconds)) %>%
   filter(name=="Jocko") %>% as.data.frame()
jocko_seconds <- mean_by_group_dplyr_jocko %>% pull(Mean_seconds)

We sent the same dataframe into a series of base R manipulations, and into a dpyler pipe expression. What we ended up with were a dataframe and a tibble. You should know at this point that a tibble is different from a dataframe. A tibble will not behave like a dataframe, so use as.data.frame(tibble) to recover a dataframe.

Last task: Pivoting a dataframe for plotting

In order to prepare data to leverage the power of ggplot2, it is necessary to convert data format from wide to long.

Wide format:

Long format:

Tidyverse:

mtcars_sub <- head(mtcars)
library(tidyverse)
mtcars_long <- mtcars_sub %>%
      rownames_to_column(var = "model") %>% # Convert row names (car models) to a column
      pivot_longer(
        cols = -c(mpg, model), # Select all columns except 'mpg, model' to pivot
        names_to = "variable", # New column for the original column names
        values_to = "value" # New column for the values from the original columns
      )

Base R:

library(data.table)
mtcars_sub$model <- rownames(mtcars_sub)
long <- melt(
  data.table(mtcars_sub),  
  id.vars = c("model"), 
  variable.name = 'desig', 
  measure.name =  colnames(mtcars_sub)[-c(length(colnames(mtcars_sub)))]
  )

ggplot2 is very powerful

mtcars_long <- mtcars %>%
      rownames_to_column(var = "model") %>% # Convert row names (car models) to a column
      pivot_longer(
        cols = -c(mpg, model), # Select all columns except 'model', 'mpg' to pivot
        names_to = "variable", # New column for the original column names
        values_to = "value" # New column for the values from the original columns
      )
      
ggplot(mtcars_long, aes(x = mpg, y = value, color = variable)) +
      geom_point() +
      facet_wrap(~ variable, scales = "free_y") + 
      labs(title = "mpg vs. variables", x = "Miles Per Gallon (mpg)", y = "Value") +
      theme_minimal()

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t")
colnames(dataf) <- c('name', 'vocalization', 'min_sec')
sec1 <- as.numeric(str_split_i(dataf$min_sec, ':', 1))*60 
sec2 <- as.numeric(str_split_i(dataf$min_sec, ':', 2))
dataf$seconds <- sec1+sec2
dataf

result <- aggregate(seconds ~ name + vocalization, data = dataf, FUN = mean)

df_wide <- reshape(result,
  idvar = "V1",              # Column(s) that identify unique observations
  timevar = "V2",         # Column whose values will become new column names
  v.names = "length",         # Column(s) whose values will populate the new columns
  direction = "wide"         # Specifies the reshaping direction
)

mean_by_group_dplyr <- dataf %>%
   group_by(V1,V2) %>%
   summarize(Mean_Value = mean(length)) 

#instead, use dplyer to pivot wide
library(tidyr)
 df_wide_tidy <- mean_by_group_dplyr %>% pivot_wider(names_from = V2, values_from = Mean_Value)
as.data.frame(df_wide_tidy)

Can combine them: but must be careful what type of objects you're working with:
For example:
#the following fails when run on mean_by_group_dplyr, because group_by outputs a 'tibble'.  Must convert to data.frame for it to work properly in base R -- as.data.frame(), or conversely, as.tibble()

df_wide <- reshape(as.data.frame(mean_by_group_dplyr),
  idvar = "V1",              # Column(s) that identify unique observations
  timevar = "V2",         # Column whose values will become new column names
  v.names = "Mean_Value",         # Column(s) whose values will populate the new columns
  direction = "wide"         # Specifies the reshaping direction
)