Coding proper

Coding proper

Re-Entry into R

At this point, we have come into contact with a few object types and a data structure in R (i.e., numeric, character, boolean, vector).

We’ve also learned how to interact with a vector by extracting elements, using base R functions, and using logical operators.

 Until now, however, we’ve been interacting with these objects without fully appreciating what these types of objects represent in the R language.  This was intentional, because gaining some familiarity with objects and how they interact from a more intuitive standpoint is a good background to have for what comes next.

The bigger picture

Objects

Data Structures

Objects

Data Structures

  • Numeric (integer, double)

  • character

  • logical

  • factor

  • complex

  • vector

  • list

  • dataframe

  • matrix

  • array

We will start our coding in this section again with the vector, and we will encounter most of the basic object types as individual elements of these vectors. We will then discuss lists, and move on to dataframes as quickly as possible, since it is the dataframe where the power of R is best leveraged.

There are four primary types of atomic vectors: logical, integer, double, and character.

Have a go at creating some of them.

Code:

num_1 <- as.integer(c(1:10)) num_2 <- as.double(c(1:10)) char_1 <- 'coding region' boolean_1 <- c(T,F,F)

 

Warning: Recycling

We saw how boolean vectors can interact with numeric vectors, for example. Try indexing the above num_1 vector with boolean_1. Figure out what is going on when the boolean vector is shorter than the vector it is indexing.

  

Coercion

Even though R vectors have a specific type, it’s quite easy to convert them to another type. This is called coercion. This flexibility works mostly to our advantage; for example, we generally don’t worry about whether a vectors is an integer versus double in R, and just consider them to be numeric. But unexpected coercion can cause confusing errors to arise, so always consider this possibility when coding.

For explicit coercion, use the as.*() functions.

v_log <- c(TRUE,FALSE,FALSE,TRUE) as.integer(v_log) as.numeric(v_log) as.character(v_log) as.logical(as.integer(v_log))

 

Warning: implicit coercion

Coercion can also be triggered by other actions, such as assigning an element of a different type into an existing vector.

 

num_vect <- c(1:10) num_vect <- c(num_vect, 'string') num_vect num_4 <- as.integer(c(1:10, 1.3))

 

Our numeric vector was silently coerced to character in one case, and in the other, the number was rounded. Notice that R did this quietly, without warning. Always pay attention to the questions: Is this object of the type I think it is? How sure am I about that?

 

Some loose ends: NA, NaN, Inf, -Inf

NA is the object that R uses to indicate ‘not available’, or ‘missing’. It will incorporate itself into a vector and take the same data type as the elements around it. Therefore, it is necessary to know for certain if the data contains NAs by testing using the function is.na().

a_vect <- c(1,-3,5,NA,7, NA) typeof(a_vect[4]) is.na(a_vect) sum(is.na(a_vect))

NaN is the object produce when there is no possible mathematical result for an operation, for example, when dividing 0 by 0, or when taking the square root of a negative number.

a_vect/0 sqrt(a_vect) is.nan(sqrt(a_vect))

We can see above how Inf, -Inf are produced when dividing by zero. Another common way that -Inf is produced is when taking the logarithm of zero

log(rep(0, 5))

Lists

Lists are needed for holding elements that violate the constraints required of atomic vectors, i.e., if one or both of the following are true.

  • The object has length greater than 1.

  • The individual objects to be collected together are not of the same type.

 

Lists are a more general form of a vector. Whereas a vector must contain elements that are all the same object type, the elements of a list may be various types. Lists are also commonly created while providing names for each element.
Three common ways to create a list from scratch:

  1. Provide the element names together with the elements of the list

list1 <- list(element1 = 1, element2 = c(2,3,4), element3 = T)
  1. Create a list and then provide the names as a vector.

list1 <- list(1, c(2,3,4), T) names(list1) = c("element1", "element2", "element3")
  1. Convert some other object to a list

a_vect <- c(11:20) a_list <- as.list(a_vect)

 4. Add to a list

list1$element4 <- 9 list1[['element5']] <- 10
  1. R’s expressions for nested lists

list1 <- list(1, c(2,3,4), T) names(list1) = c("element1", "element2", "element3") list1$element4 <- list(1,2) names(list1$element4) <- c('sub1', 'sub2') list1[['element5']] <- list(list(3), list(4,5)) #note on spaces in list names
  1. Take note: What is going on with the following two commands? They produce different objects.

#Two methods of list generation vect1 <- list(c(5:10)) vect2 <- as.list(c(5:10))

 

Indexing a list:
With vectors, we discussed using brackets [] for indexing.
With lists, the elements can be indexed in three ways, all of which are important to understand.

  1. Single brackets (returns a list of the selected element or elements)

list1 <- list(name1 = 1, name2 = c(2,3,4), name3 = T, name4 = NA) list1[2] #returns same type as 'parent' type
  1. Double brackets (returns the element indexed)

list1 <- list(element1 = 1, element2 = c(2,3,4), element3 = T) list1[[2]] #returns the element itself
  1. Using $<element name> (also returns the element indexed)

list1$element2 #returns the element itself

 

Question: how do we know the output is a list in the above 3 cases.

 

By the way, a vector can also have element names! However, note the format when R prints out the object. Also, the vector’s elements cannot be ‘directly’ indexed by their names.

a_named_vect <- c(5:10) names(a_named_vect) <- letters[1:6] a_named_vect

 

Some additional tools

stringr: Library containing many functions for string operations.

library(stringr) str1 <- c("Hello", "hello.there", "hi", "ahoy", "nope") str_detect(str1, 'he') | str_detect(str, 'op') #tests for a pattern str_split_i(str1, 'll', 1) #splits each element at a pattern,and selects the first split str_split_i(str1, 'll', 2) #splits each element at a pattern,and selects the second split str_c("Hello ", "there") # pastes together strings - like paste0() str_count(str1, "He|[.]") #!

 

table(): Quickly tally instances of elements occurring in a vector

number_vect <- sample(1:10, 1000) # problem here table(number_vect) # commonly used with hist() function

 

rnorm(): sample from a normal distribution

?rnorm rnorm(5,10,1) #note default arguments

 

order(): Sort a vector numerically or alphabetically. Note: returns the indexes of the vector used to sort the vector.

set.seed(5) #note on seed number_vect <- sample(1:10, 8) ordering <- order(number_vect, decreasing = T) number_vect[ordering]

 

%in%: This strange-looking operator returns a boolean vector based on whether the elements of the first vector are found in the second.

vect_1 <- c(1:20) vect2 <- c(7:11) vect_1[vect1 %in% vect2] vect_1[!vect1 %in% vect2]

 

For loops and conditionals: For loops are a convenient way to run the same operation on a series of objects. In the code below, we will run a mathematical expression on each of a series of numbers.

output = c() input = c(2,4,6,8) for (i in input) { output = c(output, (i/2)^4) } output

Let’s write a more complicated for loop using if, else (and ifelse).

for (i in 1:10) { if (i%%2 == 0) { print(paste0(i, ' is even')) } else { print(paste0(i, ' is odd')) } }

Functions: We will spend some time here writing functions, because it’s important to be able to write simple functions yourself. Note: positional before keyword, default values.

list_evens <- function(input){ }

More functions:

add_two_values <- function(x,y) { return(x+y) } sample_normal <- function(x,y) { return( rnorm(x, mean=y, sd = 3) ) } #rewrite

 

What is the following function doing?

Alter the function to output percentages in the structure of a named list

quality_counts <- function(seq){ highqual <- str_count(seq, "A|C|G|T") lowqual <- str_count(seq, "a|c|g|t") output <- c(highqual, lowqual) return(output) } inSeq <- ('ACGTacTGaaACACGTTGAGTacTGaaACGTacTGaa') quality_counts(inSeq)

More complex:

Make a function that calls this function on each sequence from a list. Output a list named by the sequences. Use pseudocode.

1. Function input: string(character) 2. Initialize accumulator list 3. For each sequence: call highqual call lowqual put values into a vector add vector to accumulator list using sequence as name 4. Function output: accumulator list

 

lapply: lapply is a convenient way to generate or modify a list of objects. It is similar to writing a for loop, but it takes less code. There are other ‘apply’ functions in R (e.g., apply, sapply, mapply), but we will stick with lapply. It is best to understand one of these apply functions before trying to use them all.

generate_vect <- function(x){ return( rnorm(100, mean = x, sd = 3) ) } list_of_vectors <- lapply(c(3,7,20), FUN = generate_vect)

But if we have time:

sapply()

sapply() will try to simplify the returned object. It will return:

A vector if the function applied to each element of the input returns a single value of the same type for all elements.

A matrix if the function applied to each element returns a vector of the same length for all elements. Each column of the matrix will correspond to the output for one element of the input.

A list if the simplification described above is not possible (e.g., if the function returns values of different types or lengths for different elements).

apply()

apply() is used in order to perform operations on each row (MARGIN = 1) or each column (MARGIN = 2) of a dataframe

#sapply and apply table_of_vectors <- sapply(c(3,7,20), FUN = generate_vect) apply(table_of_vectors[1:3,], MARGIN = 1, function(x) {x[1] + x[2]})

 

Dataframes !

R is designed for handling data in the form of large tables, so this section is where we’ll see the full power of R for data analysis. The underlying structure of a dataframe is just a list of vectors, so we already understand much about how to interact with them. However, we are going to need a few tools in order to work effectively with dataframes.

 

Creating a dataframe (skip, return later, if there is time.):

Method 1: from vectors, by column

#generate vectors vect1 <- c(1:10) vect2 <- sample(c(20:25), 10, replace=T) vect3 <- c(LETTERS[1:10], letters[1:10]) #create dataframe df <- as.data.frame(cbind(vect1, vect2, vect3)) df$vect1 <- as.numeric(df$vect1) df$vect2 <- as.numeric(df$vect2)

 

Method 2: from vectors, by row

vect1 <- c(1:10) vect2 <- sample(c(20:25), 10, replace=T) vect3 <- str_c(LETTERS[1:10], letters[1:10]) df <- as.data.frame(rbind(vect1, vect2, vect3))

 

Method 3: from a named list, by column

#generate list of vectors generate_vect <- function(x){ return( rnorm(100, mean = x, sd = 3) ) } list_of_vectors <- lapply(c(3,7,20), FUN = generate_vect) #provide names for the list names(list_of_vectors) <- str_c(rep('vect', 10), rep('_', 10), c(1:10)) #create dataframe df <- as.data.frame(list_of_vectors)

 

Slicing Dataframes:

Slicing a dataframes means extracting specific rows and columns. This can be done in a number of ways in base R:

  1. Using list syntax (only for columns)

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t") colnames(dataf) <- c('name', 'vocalization', 'time_min_sec') dataf$name dataf[['vocalization']]
  1. Using numerical row and column indices:

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t") colnames(dataf) <- c('name', 'vocalization', 'time_min_sec') dataf[3,2] dataf[,2] dataf[1,] dataf[1:2,1:2]
  1. Using column names and row names

dataf <- dataf <- read.table("walrus_sounds_modified.tsv", head = F, sep = "\t", fill=T, na.strings = "") colnames(dataf) <- c('name', 'vocalization', 'time_min_sec') dim(dataf) row_names_vect <- paste0(rep('row', 200), 1:200) rownames(dataf) <- row_names_vect #slice using column and row names dataf[c('row3','row8'), c('name', 'time_min_sec')]

 

Code challenge:

Convert the dataframe ‘time_min_sec’ column to total seconds, and create a new column called ‘seconds’.

 

Lets try that again but round to the nearest minute this time. Problem: We can’t use the round() function on seconds (base 60), because round only understands decimal numbers. What to do?

 

Slicing dataframes using logical vectors

Above, we have been selecting rows and columns using vectors containing the indices of the columns or rows or the names of the columns or rows.

Another and perhaps more important way to manipulate dataframes is to use logical vectors, whereby selection occurs only where TRUE appears in the vector.

Selecting rows using a test on a column vector

dataf_Jocko <- dataf[,dataf$name == "Jocko"] #wrong dataf_Jocko_chort <- dataf[dataf$name == "Jocko" & dataf$vocalization == 'chortle',] mean(dataf$total_sec[dataf$name == "Jocko" & dataf$vocalization == 'chortle'])

 

Question: What is the following code accomplishing in extracting elements of the dataframe?

mean(dataf$time_min_sec[dataf$name == "Jocko" & (dataf$vocalization == 'chortle' | dataf$vocalization == 'gong')])

 

 

Writing dataframes to files

It’s important to take analysis results from R and save them in a readable format. It is also important to use functions whenever possible in order to make your code modular.

Let’s look at how to create a function to analyze dataframes, where this function also writes the results to new files. In this example, we will generate separate dataframes for each walrus, and write the dataframes to separate files in a new directory named for each walrus.

The input to the function will be the walrus name, and the function will automatically do the slicing, creation of new column, and sorting in the previous manner. We will then include writing of the data to a file. We will be using the base R function write.table(), which is the companion to read.table.

 

dataf <- read.table("walrus_sounds.tsv", header = F, sep = "\t") colnames(dataf) <- c('name', 'vocalization', 'time_min_sec') separate_and_write = function(w_name, dff = dataf) { dataf_out <- dff[dff$name == w_name,] mins <- as.numeric(str_split_i(dataf_out$time_min_sec, ":",1))*60 secs <- as.numeric(str_split_i(dataf_out$time_min_sec, ":",2)) dataf_out$total_sec <- mins + secs dataf_out <- dataf_out[order(dataf_out$total_sec, decreasing = T),] dir.create(paste0('mkdir ', w_name, "_output")) write.table(dataf_out, paste0(w_name, "_output/", w_name, '_table.txt' ), quote = F, sep = '\t', col.names = T, row.names = F) } write.table(dataf, 'altered_df.tsv', row.names = F, col.names = T, quote=F, sep = '\t')

 

Other functions that read/write dataframes:

fwrite (from data.table package)

saveRDS()

Be aware of built-in R functions that will save you time:

rowSums(), colSums()

vect <- rnorm(100) dataframe <- as.data.frame(cbind(vect, vect, vect)) rowSums(dataframe)

Into the Tidyverse (just a bit)