worksheet2_reference.knit

Introduction to R

In-class worksheet #2

November 9th, 2022

There are many external packages that computational scientists rely on for data analysis (e.g., DESeq2 for RNA-seq processing), but perhaps one of the most important is the tidyverse library. The tidyverse library is a collection of R packages made for streamlining interactive data analysis. Most of the core functions are designed to interact with tabular data in a “tidy” format, where there is 1 column per variable and and 1 row per observation (see slides for examples). Though there are many packages within the tidyverse, we’re going to focus on two of the most commonly used: dplyr for data manipulation, and ggplot for data visualization.

1. Data extraction (`select()`, `filter()`)

Given a large, complicated, and/or messy data set, we may only be interested in examining a few variables or conditions. We will use the biotech data set to demonstrate how to subset data.

Run the code chunk below to load and take a glance at the data.

biotech <- read_csv("https://rachaelcox.github.io/classes/datasets/salary_data.csv")

## Rows: 450 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): timestamp, location, biotech_sub_industry, company_name, company_d...
## dbl  (4): years_of_experience, annual_base_salary, sign_on_bonus, sign_on_re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(biotech)

## # A tibble: 6 × 14
##   timest…¹ locat…² biote…³ compa…⁴ compa…⁵ appro…⁶ title highe…⁷ years…⁸ annua…⁹
##   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr>     <dbl>   <dbl>
## 1 5/8/202… Califo… Consul… <NA>    Private 50-200  Seni… Master…       6  160000
## 2 5/8/202… San Di… Pharma… <NA>    Public  50-200  Rese… Bachel…       0   55000
## 3 5/8/202… Midwes… Academ… <NA>    Academ… 5000+   Rese… Bachel…       2   28000
## 4 5/8/202… NYC Me… Pharma… Wuxi B… Public  1000-5… Manu… Bachel…       6   72800
## 5 5/8/202… Boston… Pharma… Abbvie  Public  5000+   Seni… Ph.D o…       0  120000
## 6 5/8/202… Canada  Pharma… <NA>    Start-… 1-50    Soft… Associ…      20  140000
## # … with 4 more variables: annual_target_bonus <chr>, annual_equity <chr>,
## #   sign_on_bonus <dbl>, sign_on_relocation_assistance <dbl>, and abbreviated
## #   variable names ¹timestamp, ²location, ³biotech_sub_industry, ⁴company_name,
## #   ⁵company_details, ⁶approximate_company_size,
## #   ⁷highest_achieved_formal_education, ⁸years_of_experience,
## #   ⁹annual_base_salary

The biotech data set was pulled from a survey distributed on the Reddit forum r/biotech in 2022, where 450 participants answered questions regarding their role, compensation, level of education and location as it relates to their employment in the biotechnology industry. Let’s say we are only interested in salary, education level, and location of each individual. To see all of the column names present in the data frame, we can use the names() function:

names(biotech)

##  [1] "timestamp"                         "location"                         
##  [3] "biotech_sub_industry"              "company_name"                     
##  [5] "company_details"                   "approximate_company_size"         
##  [7] "title"                             "highest_achieved_formal_education"
##  [9] "years_of_experience"               "annual_base_salary"               
## [11] "annual_target_bonus"               "annual_equity"                    
## [13] "sign_on_bonus"                     "sign_on_relocation_assistance"

The columns that correspond to salary, education level, and location are annual_base_salary, highest_achieved_formal_education, and location, respectively. We can extract these columns with the select() function, and then save the result to a new data frame called biotech_subset:

# extract the "location", "annual_base_salary" and "highest_achieved_formal_education" columns
biotech_subset <- biotech %>%
  select(location, annual_base_salary,
         highest_achieved_formal_education)

Then, let’s say, we’re interested in looking at individuals with jobs in the Bay Area. To obtain this information, we need to use filter() to extract the rows where location is equal to “Bay Area”, or location == "Bay Area". Notice that, in this example, I do not assign the result to a new data frame, and thus the result of the function is not “saved” (i.e., only displayed on the console or inline output).

# filter subset biotech data for individuals working in the Bay Area
biotech_subset %>%
  filter(location == "Bay Area")

## # A tibble: 88 × 3
##    location annual_base_salary highest_achieved_formal_education
##    <chr>                 <dbl> <chr>                            
##  1 Bay Area              82000 Bachelor's Degree or Equivalent  
##  2 Bay Area             131200 Bachelor's Degree or Equivalent  
##  3 Bay Area             155000 Ph.D or Equivalent               
##  4 Bay Area             220000 Ph.D or Equivalent               
##  5 Bay Area              95000 Master's Degree or Equivalent    
##  6 Bay Area              75000 Bachelor's Degree or Equivalent  
##  7 Bay Area             100000 Bachelor's Degree or Equivalent  
##  8 Bay Area             170000 Ph.D or Equivalent               
##  9 Bay Area             148000 Ph.D or Equivalent               
## 10 Bay Area             135000 Ph.D or Equivalent               
## # … with 78 more rows

One of the best features of the tidyverse is that it encourages piping, which is a process that executes multiple functions at once on a data set. This feature can save us many, many lines of codes while still maintaining excellent readability. Referring to the above example, we can chain both steps together using %>%, also known as the tidyverse pipe:

biotech_filtered <- biotech %>%
  select(location, annual_base_salary,
         highest_achieved_formal_education) %>%
  filter(location == "Bay Area")

Problem Set #1:

Take the biotech data set and use select() to extract the columns location, highest_achieved_formal_education, and annual_base_salary, biotech_sub_industry, and approximate_company_size. Assign it to a new data frame called my_biotech.
Filter the my_biotech dataframe such that you only keep rows where annual_base_salary is more than $75,000.

# your R code here

Challenge Problem: If that was easy… How many individuals have a salary greater than or equal to $105,000 and work in the “subindustries” of “Academia/Research”?

# your R code here

Eight r/biotech Reddit users work in academia and have a salary >= $105,000. They live in the Bay Area (3), Boston Metro Area (3), San Diego (1) and Seattle (1).

2. Data aggregation (`group_by()` and `summarize()`)

Most of the time, our goal is to learn something about the relationships between the variables in our data set. For instance, the biotech data set samples a diversity of Reddit users working in the biotechnology industry; how much does the average salary vary across level of education? With group_by and summarize, we can do a quick investigation.

# calculate the mean and median salary
# for each level of education
biotech %>%
  group_by(highest_achieved_formal_education) %>%
  summarize(median_salary = median(annual_base_salary),
            mean_salary = mean(annual_base_salary))

## # A tibble: 6 × 3
##   highest_achieved_formal_education median_salary mean_salary
##   <chr>                                     <dbl>       <dbl>
## 1 Associate’s Degree or Equivalent         55000       61817.
## 2 Bachelor's Degree or Equivalent          80293       87063.
## 3 High School or Equivalent                59249.      52124.
## 4 M.D./PharmD/D.D.S. or Equivalent         90000       89333.
## 5 Master's Degree or Equivalent            90000       95648.
## 6 Ph.D or Equivalent                      135000      138826.

Note the way I’ve structured the above code. At the beginning, the data set we’re manipulating is on the first line, and each subsequent function the data passes through gets a new line/indentation. Though not functionally necessary, styling your R/tidyverse code this way greatly improves readability. Readability of your code should ALWAYS be high priority.

We can also probe potential biases in our data set–for example, the sample size for each of the levels of education we probed above.

# add a calculation for the # of individuals per level of education
biotech %>%
  group_by(highest_achieved_formal_education) %>%
  summarize(median_salary = median(annual_base_salary),
            mean_salary = mean(annual_base_salary),
            count = n())

## # A tibble: 6 × 4
##   highest_achieved_formal_education median_salary mean_salary count
##   <chr>                                     <dbl>       <dbl> <int>
## 1 Associate’s Degree or Equivalent         55000       61817.    11
## 2 Bachelor's Degree or Equivalent          80293       87063.   193
## 3 High School or Equivalent                59249.      52124.     4
## 4 M.D./PharmD/D.D.S. or Equivalent         90000       89333.     3
## 5 Master's Degree or Equivalent            90000       95648.    73
## 6 Ph.D or Equivalent                      135000      138826.   166

The number of respondents with a high school eduction and professional school education are very low compared to the other categories.

Problem Set #2:

How does the mean salary vary by location? Using the biotech data set again, use group_by() on the location column and summarize() the mean() of annual_base_salary.

# your R code here

Challenge Problem: If that was easy… For individuals working in the Boston Metro Area, what is the mean salary for the different types biotech_sub_industry and highest_achieved_formal_education? Include a column that calculates the sample size of each group.

Hint: This will require using the filter() function first to just get rows corresponding to the Boston Metro Area, followed by a group_by() of multiple variables, and then finally summarize().

# your R code here

3. Data visualization

Arguably the Tidyverse is the most used and well-known for its data visualization package, ggplot2. Its syntax is similar to but distinct from other tidyverse packages. In our previous examples using dplyr, we chained functions together using the pipe (%>%). When making a ggplot, we layer plot functions using a plus sign (+). Continuing to use the biotech data set, let’s visualize the relationship between years of experience in the biotechnology industry and annual base salary.

# plot annual base salary vs years of experience
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary)) +
  geom_point()

It looks like there might be correlation between years of experience and compensation, but it looks pretty noisy. The biggest contributors to variance are likely level of education and location. We can separate this plot into a grid of scatter plots for each category using facet_wrap().

# generate a separate scatter plot for each level of education
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary)) +
  geom_point() +
  facet_wrap(~highest_achieved_formal_education)

Another feature of ggplot() is the ability to color, shape, or size points by other variables in the data set. In the previous plot, we were interested in the correlation between base salary and years of experience. Now, we’ll visualize the approximate_company_size variable on the same plot using the color aesthetic. Since the default colors used by ggplot are not colorblind-friendly, we’ll add a layer that changes the point colors to a palette that is.

# change the color of points to reflect size of employer organization
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary,
                    color = approximate_company_size)) +
  geom_point() +
  facet_wrap(~highest_achieved_formal_education) +
  scale_color_colorblind()

## Warning: Removed 2 rows containing missing values (geom_point).

One common task I often run into is the re-ordering of legend variables. For ease of interpretation, we want the company size ranges to either ascend or descend in order. This is not automatically implemented by R because the ranges are considered a “character” or “string” data type, so R therefor does not recognize the values as numbers that can be ordered. However, we can manually override this ordering using fct_relevel.

The first step in this process is to create a vector that specifies the desired order:

# specify the categories in the desired order
desired_order <- c("1-50", "50-200", 
                   "200-1000", "1000-5000", "5000+")

Next, we can use mutate() to change the data type of the column to a manually ordered “factor” column and pass that directly to ggplot. Note that the results of the mutate() function pipes into ggplot(), and therefore we do not have to specify the data set as our first argument to the ggplot() function.

# make plot with reordered legend variables
biotech %>%
  mutate(
    approximate_company_size = fct_relevel(
      approximate_company_size, desired_order)
    ) %>%
    ggplot(., aes(x = years_of_experience, 
                      y = annual_base_salary,
                      color = approximate_company_size)) +
    geom_point() +
    facet_wrap(~highest_achieved_formal_education) +
    scale_color_colorblind()

## Warning: Removed 2 rows containing missing values (geom_point).

Finally, we can apply a number of layers to the plot to make it publication ready. The layer labs() will allow us to change the legends, and we can specify point size and degree of point opacity inside geom_point(). Additionally, we can pre-filter NAs from our data set before passing the data to ggplot using the filter() function.

# make it pretty
biotech %>%
  mutate(
    approximate_company_size = fct_relevel(
      approximate_company_size, desired_order)
    ) %>%
  filter(!is.na(approximate_company_size)) %>%
    ggplot(., aes(x = years_of_experience, 
                      y = annual_base_salary,
                      color = approximate_company_size)) +
    geom_point(size = 2, alpha = 0.8) +
    facet_wrap(~highest_achieved_formal_education) +
    scale_color_colorblind() +
    labs(x = "Years of Experience", y = "Base salary",
         color = "Company Size")

Problem Set #3:

Filter the biotech data to contain only the following, and then assign it to a new data frame:
- Individuals working in the Bay Area & the Boston Metro Area (e.g., create a filter for the location column)
- Individuals working for pharmaceutical companies (e.g., create a filter for the biotech_sub_industry column)
Make a scatter plot with years_of_experience on the x-axis and annual_base_salary on the y-axis.
Color the points by the highest_achieved_formal_education column.
Change the color of the points to a colorblind-friendly palette.
Rename the axes labels using labs(). Bonus: Change the shape of the points to reflect the company_details column.

# your R code here

Challenge Problem: If that was easy…

Filter the biotech data to contain only:
- The top 3 locations with the most data points.
- Only individuals with Bachelor’s, Master’s and Doctorate’s degrees.
Make a density plot showing the distribution of salaries for each level of education (hint: assign x = annual_base_salary, fill = highest_achieved_formal_education, and then use geom_density()).
Use facet_wrap() with location to make a grid of 3 density distribution plots.
Change the default colors and reduce the opacity of the density distributions to visualize the overlaps.
Move the legend position to the top of the plot.

# your R code here

Introduction to R

In-class worksheet #2

1. Data extraction (select(), filter())

2. Data aggregation (group_by() and summarize())

3. Data visualization

1. Data extraction (`select()`, `filter()`)

2. Data aggregation (`group_by()` and `summarize()`)