Re-entry into R
Re-Entry into R
At this point, we have come into contact with a few object types in R: numeric, character, vector.
We’ve also learned how to interact with a vector by extracting elements, using base R functions, and using logical operators.
Until now, however, we’ve been interacting with these objects without fully appreciating what these types of objects represent in the R language. This was intentional, because gaining some familiarity with objects and how they interact from a more intuitive standpoint is a great background to have for what comes next.
At this point, I now would like to take a more detailed approach to introducing more complex aspects of R, namely, a filler range of it’s objects, its data types, as well as some useful packages and functions.
Objects | Data Structures |
|---|---|
|
|
We will start our coding in this section with the vector, and we will encounter all of the basic object types as individual elements of these vectors. We will then discuss lists, and move on to dataframes as quickly as possible, since it is the dataframe where the power of R is best leveraged.
There are four primary types of atomic vectors: logical, integer, double, and character.
Have a go at creating some of them.
Code:
num_1 <- as.integer(c(1:10))
num_2 <- as.double(c(1:10))
char_1 <- 'coding region'
boolean_1 <- c(T,F,F)
Warning: Recycling
We saw how boolean vectors can interact with numeric vectors, for example. Try indexing the above num_1 vector with boolean_1. Figure out what is going on when the boolean vector is shorter than the vector it is indexing.
Coercion
Even though R’s vectors have a specific type, it’s quite easy to convert them to another type. This is called coercion. As a language for data analysis, this flexibility works mostly to our advantage. It’s why we generally don’t stress out over integer versus double in R. It’s why we can compute a proportion as the mean of a logical vector (we exploit automatic coercion to integer in this case). But unexpected coercion is a rich source of programming puzzles, so always consider this possibility when coding.
For explicit coercion, use the as.*() functions.
v_log <- c(T,F,F,T)
as.integer(v_log)
as.numeric(v_log)
as.character(v_log)Note: logical vectors can be expressed equally well by the integers 0 for FALSE and 1 for TRUE.
as_logical <- as.logical(c(1,1,0,0,0))
as_logical
Warning: implicit coercion
Coercion can also be triggered by other actions, such as assigning an element of a different type into an existing vector.
num_vect <- c(1:10)
num_vect <- c(num_vect, 'string')
num_vect
Our numeric vector was silently coerced to character. Notice that R did this quietly, without warning. Always pay attention to the question: Is this object of the type I think it is? How sure am I about that?
Some loose ends: NA, NaN, Inf, -Inf
NA is the object that R uses to indicate ‘not available’, or ‘missing’. It will incorporate itself into a vector and take the same data type as the elements around it. Therefore, it is necessary to know for certain if the data contains NAs by testing using the function is.na().
a_vect <- c(1,-3,5,NA,7, NA)
typeof(a_vect[4])
is.na(a_vect)
sum(is.na(a_vect))NaN is the object produce when there is no possible mathematical result for an operation, for example, when dividing 0 by 0, or when taking the square root of a negative number.
a_vect/0
sqrt(a_vect)
is.nan(sqrt(a_vect))We can see above how Inf is produced when dividing by zero. The common way that -Inf is produced is when taking the logarithm of zero
log(rep(0, 5))Lists
Lists are needed for holding objects that violate the constraints imposed by an atomic vector: in other words, if one or both of the following is true.
The object has length greater than 1.
The individual objects to be collected together are not of the same type.
Lists are a more general form of a vector. Whereas a vector must contain elements that are all the same object type, the elements of a list may be various types. Lists are also commonly created while providing names for each element.
Three common ways to create a list from scratch:
Provide the element names together with the elements of the list
list1 <- list(element1 = 1, element2 = c(2,3,4), element3 = T)Create a list and then provide the names as a vector.
list1 <- list(1, c(2,3,4), T)
names(list1) = c("element1", "element2", "element3")Convert some other object to a list
a_vect <- c(11:20)
a_list <- as.list(a_vect)
Indexing a list:
With vectors, we saw that there was one way of indexing elements: using brackets.
With lists, the elements can be indexed in three ways, all of which are important to understand.
Single brackets (returns a list of the selected element or elements)
list1 <- list(name1 = 1, name2 = c(2,3,4), name3 = T, name4 = NA)
list1[2]Double brackets (returns the element indexed)
list1 <- list(element1 = 1, element2 = c(2,3,4), element3 = T)
list1[[2]]Using
$<element name>
list1$element2
Question: how do we identify a list from what R outputs to the console in the above 3 cases.
Hint: In 2. above, use : to select a range of elements
By the way, a vector can also have element names! However, note the format when R prints out the object.
a_named_vect <- c(5:10)
names(a_named_vect) <- letters[1:6]
a_named_vect
Dataframes
R is designed for handlign data in the form of large tables, so this section is where we’ll see the full power of R for data analysis. The underlying structure of a dataframe is just a list of vectors, so we already understand much about how to handle them. However, we are going to need a few tools in order to work effectively with dataframes.
stringr: Library containing many functions for string operations.
library(stringr)
str <- c("Hello", "hello there", "hi", "ahoy", "nope")
str_detect(str, 'he') #tests whether elements contain a pattern
str_split_i(str, 'll', 1) #splits each element at a pattern,and selects the first split
str_split_i(str, 'll', 2) #splits each element at a pattern,and selects the second split
str_c("Hello ", "there") # pastes together strings
table(): Quickly count instances of elements occurring in a vector
number_vect <- sample(1:10, 1000) # problem here
table(number_vect)
rnorm(): sample from a normal distribution
?rnorm
rnorm(5,10,1)
order(): Sort a vector numerically or alphabetically. Note: returns the indexes of the vector used to sort the vector.
number_vect <- sample(1:10, 10)
order(number_vect, decreasing = T)
number_vect[order(number_vect)]
%in%: This strange-looking operator returns a boolean vector based on whether the elements of the first vector are found in the second.
vect_1 <- c(1:20)
vect2 <- c(7:11)
vect_1[vect1 %in% vect2]
vect_1[!vect1 %in% vect2]
For loops: For loops are a convenient way to run the same operation on a series of objects. In the code below, we will run a mathematical expression on each of a series of numbers.
output = c()
input = c(2,4,6,8)
for (i in input) {
output = c(output, (i/2)^4)
}
output
Functions: We won’t spend much time with writing functions, but you will need to be able to create some simple functions of the type shown below.
#A function that takes in a value, and samples a normal distribution
#using that value x as a mean, returning a vector of length y.
add_two_values <- function(x,y) { return(x+y) }
sample_normal <- function(x,y) { return( rnorm(x, mean=y) ) }
lapply: lapply is a convenient way to generate or modify a list of objects. It is similar to writing a for loop, but it takes less code. There are other ‘apply’ functions in R (e.g., apply, sapply, mapply), but we will stick with lapply. It is best to understand one of these apply functions before trying to use them all.
generate_vect <- function(x){ return( rnorm(100, mean = x, sd = 3) ) }
list_of_vectors <- lapply(c(3,7,20), FUN = generate_vect)
Creating a dataframe:
Method 1: from vectors, by column
#generate vectors
vect1 <- c(1:10)
vect2 <- sample(c(20:25), 10, replace=T)
vect3 <- str_c(LETTERS[1:10], letters[1:10])
#create dataframe
df <- as.data.frame(cbind(vect1, vect2, vect3))
df$vect1 <- as.numeric(df$vect1)
df$vect2 <- as.numeric(df$vect2)
Method 2: from vectors, by row
vect1 <- c(1:10)
vect2 <- sample(c(20:25), 10, replace=T)
vect3 <- str_c(LETTERS[1:10], letters[1:10])
df <- as.data.frame(rbind(vect1, vect2, vect3))
Method 3: from a named list, by column
#generate list of vectors
generate_vect <- function(x){ return( rnorm(100, mean = x, sd = 3) ) }
list_of_vectors <- lapply(c(3,7,20), FUN = generate_vect)
#provide names for the list
names(list_of_vectors) <- str_c(rep('vect', 10), rep('_', 10), c(1:10))
#create dataframe
df <- as.data.frame(list_of_vectors)
Slicing Dataframes:
Slicing a dataframes means extracting specific rows and columns. This can be done in a number of ways in base R:
numerical row and column indices:
dataf <- read.table("data/walrus_sounds.tsv", header = F, sep = "\t")
dataf[3,2]
dataf[,2]
dataf[1,]
dataf[1:2,1:2]Using column names and row names
#Read in dataframe, define column names and row names
dataf <- read.table("data/walrus_sounds.tsv", header = F, sep = "\t")
colnames(dataf) <- c('name', 'sound_type', 'time_min_sec')
row_names_vect <- str_c(expand.grid(letters, letters)$Var2, expand.grid(letters, letters)$Var1)
rownames(dataf) <- row_names_vect[1:dim(dataf)[1]]
#slice using column and row names
dataf[c('aa','ab'), c('name', 'time_min_sec')]
Code challenge:
Convert the dataframe ‘time_min_sec’ column to total seconds in a new vector called ‘sec’.
Now, add the new vector to our dataframe in a column named ‘total_sec’ as follows:
dataf$total_sec = sec
Lets try that again but round to the nearest minute this time. Problem: We can’t use the round() function on seconds, because round only understands decimal numbers. What to do?
Slicing dataframes using logical vectors
Above, we have been selecting rows and columns using vectors containing the indices of the columns or rows or the names of the columns or rows.
Another and perhaps more important way to manipulate dataframes is to use logical vectors, whereby selection occurs only where TRUE appears in the vector.
Selecting rows using a test on a column vector
dataf_Jocko <- dataf[,dataf$name == "Jocko"]
dataf_Jocko_chort <- dataf[dataf$name == "Jocko" & dataf$sound_type == 'chortle',]
mean(dataf$total_sec[dataf$name == "Jocko" & dataf$sound_type == 'chortle'])
mean(dataf$total_sec[dataf$name == "Jocko" & (dataf$sound_type == 'chortle' | dataf$sound_type == 'chortle')])
Question: What is the following code accomplishing in extracting elements of the dataframe?
mean(dataf$total_sec[dataf$name == "Jocko" & (dataf$sound_type == 'chortle' | dataf$sound_type == 'gong')])x
Writing dataframes to files
It’s important to take analysis results from R and save them in a readable format. It is also important to use functions whenever possible in order to make your code modular.
Let’s look at how to create a function to analyze dataframes, where this function also writes the results to new files. In this example, we will generate separate dataframes for each walrus, and write the dataframes to separate files in a new directory named for each walrus. We will be writing a single function to do all of this, so it will take some time, but it’s a good example of a useful function.
The input to the function will be the walrus name, and the function will automatically do the slicing, creation of new column, and sorting in the previous manner. We will then include writing of the data to the files. We will be using the base R function write.table(), which is the companion to read.table.
separate_and_write = function(w_name, dff = dataf) {
dataf_out <- dff[dff$name == w_name,]
dataf_out$total_sec <- as.numeric(str_split_i(dataf_out$time_min_sec, ":",1))*60 + as.numeric(str_split_i(dataf_out$time_min_sec, ":",2))
dataf_out <- dataf_out[order(dataf_out$total_sec, decreasing = T),]
system(str_c('mkdir ', w_name, "_output"))
write.table(dataf_out, str_c(w_name, "_output/", w_name, '_table.txt' ),
quote = F, sep = '\t', col.names = T, row.names = F)
}
Other functions that write dataframes:
fwrite (from data.table package)