worksheet1_solutions.knit

Introduction to R

In-class worksheet #1, solutions

November 3rd, 2021

Computational analyses require methods and notes to be recorded the same way you would for wet lab experiments. An excellent way to do this is via R Markdown documents. R Markdown documents are documents that combine text, R code, and R code output, and figures. They are a great way to produce self-contained and documented statistical analyses.

In this first worksheet, you will learn how to do some basic markdown editing in addition to the basic use of variables and functions in R. After you have made a change to the document, press “Knit HTML” in R Studio and see what kind of a result you get. Note: You may have to disable pop-ups to get this to work; moreover, if the document contains any erroneous code (e.g., typos), you will get an error that prevents the knit. You can bypass this by debugging the code or commenting it out with a #.

1. Basic Markdown

Below I have demonstrated some basic R Markdown features, as described here. In your own work, you can use Markdown syntax to organize your coding notebook.

This text is bold.

This text is in italics.

This is a numbered list:

Item 1
Item 2
Item 3

A bulleted list:

Item 1
Item 2
Item 3

A nested list:

Item 1
- Item 1.1. Note that 4 spaces are required for the nesting to work properly.
- Item 1.2
Item 2

Block quote:

“Science is magic that works.” — Kurt Vonnegut

2. Embedding R code

R code embedded in R chunks will be executed and the output will be shown.

# R code is embedded into this chunk
# when we start a line with '#', that tells R not to interpret the line as code
# this is called commenting code, and documents its purpose

# the code below assigns integers to the variables 'x' and 'y'
x <- 7  
y <- 1029

# we can perform operations with variables
z <- x * y
z

## [1] 7203

# this is a string assigned to the variable 'my_name'
my_name <- "Rachael"

# we can create vectors of integers and strings
nums <- c(4, 8, 3, 6, 9)
fruits <- c("strawberries", "bananas", "apples", "peaches", "mangos")

# and combine them into a table using the data.frame() function
grocery_list <- data.frame(fruits, nums)
grocery_list

##         fruits nums
## 1 strawberries    4
## 2      bananas    8
## 3       apples    3
## 4      peaches    6
## 5       mangos    9

# there are a number of ways to extract specific information from a table
# for instance, selecting the first column:
grocery_list[1]

##         fruits
## 1 strawberries
## 2      bananas
## 3       apples
## 4      peaches
## 5       mangos

grocery_list['fruits']

##         fruits
## 1 strawberries
## 2      bananas
## 3       apples
## 4      peaches
## 5       mangos

select(grocery_list, fruits)

##         fruits
## 1 strawberries
## 2      bananas
## 3       apples
## 4      peaches
## 5       mangos

# the following code also targets the first column, but extracts the information in a different way
# can you spot the difference?
grocery_list$fruits

## [1] strawberries bananas      apples       peaches      mangos      
## Levels: apples bananas mangos peaches strawberries

Problem Set #1:

Assign integers to variables (demonstrated in the above code block).
Assign some strings to variables.
Make a vector of strings containing your top 5 favorite foods.
Make a vector containing 5 numbers.
Combine the two vectors you created in the previous step into one data frame.
Call the first column of the data frame that you create.

# assigning integers to variables
fav_num <- 6
second_fav_num <- 13
some_new_num <- second_fav_num / fav_num

# assigning strings to variables
fav_enzyme <- "cyclooxygenase"

# creating a vector of strings
fav_foods <- c("sashimi", "dumplings", "jambalaya", "tacos", "pizza")
fav_foods

## [1] "sashimi"   "dumplings" "jambalaya" "tacos"     "pizza"

# creating a vector of integers
random_nums <- c(6, 13, 21, 51, 63)
random_nums

## [1]  6 13 21 51 63

# combining vectors into a dataframe
new_df <- data.frame(fav_foods, random_nums)
new_df

##   fav_foods random_nums
## 1   sashimi           6
## 2 dumplings          13
## 3 jambalaya          21
## 4     tacos          51
## 5     pizza          63

# calling a column in a dataframe
new_df$fav_foods

## [1] sashimi   dumplings jambalaya tacos     pizza    
## Levels: dumplings jambalaya pizza sashimi tacos

3. Built-in functions and data sets

A function is statement internally (i.e., “under the hood”) coded to perform a specific task. For instance, the head() function displays the first several rows of a data frame or values in a vector.

R comes with many built-in functions and data sets. Type data() in the console to look at a list of all available data sets. Type ?iris in the console for more information about this specific data set. Important: You can ask for help with any built-in data set or function by typing ?<function> in the console; for example, ?head or ?summary.

You can take a glance at the iris data set using the head() function. Run the code chunk below to test this.

# preview the first few rows a data frame
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

You can also use the summary() function to see the summary statistics of a data set at a glance.

# look at summary statistics for the iris data set
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

You can see the column names of iris from the code output above. We can perform calculations on this data set using a number of functions built into R. See the example below, which calculates the median of the Sepal.Length column.

# calculate median sepal length across all species of iris
median(iris$Sepal.Length)

## [1] 5.8

Problem Set #2:

Calculate the mean of the Petal.Length column using the mean() function.
Calculate the range of the Petal.Width column using the range() function.

# calculate the mean of the `Petal.Length` column in the `iris` dataset
mean(iris$Petal.Length)

## [1] 3.758

# calculate the range of the `Petal.Width` column in the `iris` dataset
range(iris$Petal.Width)

## [1] 0.1 2.5

4. Reading and writing files

There are several ways to upload data into your R environment. We covered one way in Part 1 of the worksheet: manual entry. However, this is clearly not feasible for big data sets–more often, we want to read in a file containing our data. Also, we tend to modify data frames and save them to a new file.

Problem Set #3:

Download the test data set mushrooms_small.csv from the “Test data set” link on the class webpage.
Upload it to the RStudio server. Use the “Upload” button in the panel on the right.
Use the read_csv() function to read the file, and save it as a data frame called mushrooms. Important: The file name must be given to the function as a string.
Use the head() function to preview the first 10 rows of the new data frame. Specify the integer as the second argument of the function.
Save the output of the head() function as a new data frame called mushrooms_tiny.
Use the write_csv function to write the data frame mushrooms_tiny to a new .csv file. Important: The file name must be given to the function as a string.

Note: If you are coding on a local installation of R, you will have to specify a path to the location of the file or move the file to the working directory. Local installations of R do not have an “Upload” function. These concepts are covered at the end of this section.

# read a file
mushrooms <- read_csv("mushrooms_small.csv")

## Rows: 100 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): class, cap_shape, cap_surface, cap_color, odor, gill_spacing, gill...
## dbl  (1): ring_number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# look at the first 10 rows of that dataset
head(mushrooms, 10)

## # A tibble: 10 × 17
##    class   cap_s…¹ cap_s…² cap_c…³ odor  gill_…⁴ gill_…⁵ gill_…⁶ stalk…⁷ stalk…⁸
##    <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 poison… convex  scaly   red     spicy close   narrow  buff    taperi… missing
##  2 edible  convex  smooth  red     none  close   broad   white   enlarg… missing
##  3 edible  convex  smooth  gray    none  crowded broad   chocol… taperi… equal  
##  4 edible  flat    scaly   brown   almo… close   broad   pink    enlarg… rooted 
##  5 edible  flat    fibrous brown   none  crowded broad   pink    taperi… equal  
##  6 poison… convex  fibrous yellow  foul  close   broad   chocol… enlarg… bulbous
##  7 poison… convex  smooth  brown   spicy close   narrow  buff    taperi… missing
##  8 edible  bell    scaly   white   almo… close   broad   brown   enlarg… club   
##  9 edible  knobbed smooth  brown   none  close   broad   orange  enlarg… missing
## 10 edible  bell    smooth  white   anise close   broad   black   enlarg… club   
## # … with 7 more variables: veil_type <chr>, veil_color <chr>,
## #   ring_number <dbl>, ring_type <chr>, spore_print_color <chr>,
## #   population <chr>, habitat <chr>, and abbreviated variable names ¹cap_shape,
## #   ²cap_surface, ³cap_color, ⁴gill_spacing, ⁵gill_size, ⁶gill_color,
## #   ⁷stalk_shape, ⁸stalk_root

# save the first 10 rows to a new dataframe
mushrooms_tiny <- head(mushrooms, 10)

# write to file
write_csv(mushrooms_tiny, "mushrooms_tiny.csv")

5. Locating files

For this class, we are using a computer server where everyone has a preset working directory associated with your unique student ID number. Type getwd() to see the file path to your working directory. On a local installation, the output of this function might look something like C:/Users/Rachael/Documents.

# output the file path associated with the current working directory
getwd()

## [1] "/stor/home/student20"

This is the directory R will default to for reading and writing files. Ideally, for real life projects, we keep all the information we need organized into folders (aka sub-directories). More often than not, we have to tell R which sub-directory we want to read a file from or write a file to. Perform the following steps to familiarize yourself with file paths and R’s perception of where files are:

Use the “New Folder” option in the window on the bottom right to create a new folder called “new_data”.
Select mushrooms_tiny.csv by checking the box.
Go to “More” > “Move…” and select the new “new_data” folder.
Run list.files() to see all the files in the current working directory.
Run list.files("new_data") to see the files in the new sub-directory.
Run #5 again, but this time specify that full.names = TRUE as the second argument in the function.

# list files in current working directory
list.files()

##  [1] "clean_salary_data"         "mushrooms_small.csv"      
##  [3] "mushrooms_tiny.csv"        "new_data"                 
##  [5] "R"                         "worksheet1_blank.Rmd"     
##  [7] "worksheet1_reference.html" "worksheet1_reference.Rmd" 
##  [9] "worksheet1_solutions.Rmd"  "worksheet1.Rmd"           
## [11] "worksheet2_blank.html"     "worksheet2_blank.Rmd"     
## [13] "worksheet2_reference.html" "worksheet2_reference.Rmd" 
## [15] "worksheet2_solutions.html" "worksheet2_solutions.Rmd" 
## [17] "worksheet2.Rmd"

# list files in the sub-directory called "new_data"
list.files("new_data")

## [1] "mushrooms_tiny.csv"

# list the full path to the files in "new_data"
list.files("new_data")

## [1] "mushrooms_tiny.csv"

list.files("new_data", full.names = TRUE) # this becomes very useful for reading many sub-directory files at once

## [1] "new_data/mushrooms_tiny.csv"

Clear your global environment (the broom symbol in the top right window). Read the file in the sub-directory “new_data” using read_csv. The function will need the full path given by the output from the code chunk above.

# read in data from a sub-directory
read_csv("new_data/mushrooms_tiny.csv")

## Rows: 10 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): class, cap_shape, cap_surface, cap_color, odor, gill_spacing, gill...
## dbl  (1): ring_number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 10 × 17
##    class   cap_s…¹ cap_s…² cap_c…³ odor  gill_…⁴ gill_…⁵ gill_…⁶ stalk…⁷ stalk…⁸
##    <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 poison… convex  scaly   red     spicy close   narrow  buff    taperi… missing
##  2 edible  convex  smooth  red     none  close   broad   white   enlarg… missing
##  3 edible  convex  smooth  gray    none  crowded broad   chocol… taperi… equal  
##  4 edible  flat    scaly   brown   almo… close   broad   pink    enlarg… rooted 
##  5 edible  flat    fibrous brown   none  crowded broad   pink    taperi… equal  
##  6 poison… convex  fibrous yellow  foul  close   broad   chocol… enlarg… bulbous
##  7 poison… convex  smooth  brown   spicy close   narrow  buff    taperi… missing
##  8 edible  bell    scaly   white   almo… close   broad   brown   enlarg… club   
##  9 edible  knobbed smooth  brown   none  close   broad   orange  enlarg… missing
## 10 edible  bell    smooth  white   anise close   broad   black   enlarg… club   
## # … with 7 more variables: veil_type <chr>, veil_color <chr>,
## #   ring_number <dbl>, ring_type <chr>, spore_print_color <chr>,
## #   population <chr>, habitat <chr>, and abbreviated variable names ¹cap_shape,
## #   ²cap_surface, ³cap_color, ⁴gill_spacing, ⁵gill_size, ⁶gill_color,
## #   ⁷stalk_shape, ⁸stalk_root