worksheet2_solutions.knit

Introduction to R

In-class worksheet #2, solutions

November 9th, 2022

There are many external packages that computational scientists rely on for data analysis (e.g., DESeq2 for RNA-seq processing), but perhaps one of the most important is the tidyverse library. The tidyverse library is a collection of R packages made for streamlining interactive data analysis. Most of the core functions are designed to interact with tabular data in a “tidy” format, where there is 1 column per variable and and 1 row per observation (see slides for examples). Though there are many packages within the tidyverse, we’re going to focus on two of the most commonly used: dplyr for data manipulation, and ggplot for data visualization.

1. Data extraction (`select()`, `filter()`)

Given a large, complicated, and/or messy data set, we may only be interested in examining a few variables or conditions. We will use the biotech data set to demonstrate how to subset data.

Run the code chunk below to load and take a glance at the data.

biotech <- read_csv("https://rachaelcox.github.io/classes/datasets/salary_data.csv")

## Rows: 450 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): timestamp, location, biotech_sub_industry, company_name, company_d...
## dbl  (4): years_of_experience, annual_base_salary, sign_on_bonus, sign_on_re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(biotech)

## # A tibble: 6 × 14
##   timest…¹ locat…² biote…³ compa…⁴ compa…⁵ appro…⁶ title highe…⁷ years…⁸ annua…⁹
##   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr>     <dbl>   <dbl>
## 1 5/8/202… Califo… Consul… <NA>    Private 50-200  Seni… Master…       6  160000
## 2 5/8/202… San Di… Pharma… <NA>    Public  50-200  Rese… Bachel…       0   55000
## 3 5/8/202… Midwes… Academ… <NA>    Academ… 5000+   Rese… Bachel…       2   28000
## 4 5/8/202… NYC Me… Pharma… Wuxi B… Public  1000-5… Manu… Bachel…       6   72800
## 5 5/8/202… Boston… Pharma… Abbvie  Public  5000+   Seni… Ph.D o…       0  120000
## 6 5/8/202… Canada  Pharma… <NA>    Start-… 1-50    Soft… Associ…      20  140000
## # … with 4 more variables: annual_target_bonus <chr>, annual_equity <chr>,
## #   sign_on_bonus <dbl>, sign_on_relocation_assistance <dbl>, and abbreviated
## #   variable names ¹timestamp, ²location, ³biotech_sub_industry, ⁴company_name,
## #   ⁵company_details, ⁶approximate_company_size,
## #   ⁷highest_achieved_formal_education, ⁸years_of_experience,
## #   ⁹annual_base_salary

The biotech data set was pulled from a survey distributed on the Reddit forum r/biotech in 2022, where 450 participants answered questions regarding their role, compensation, level of education and location as it relates to their employment in the biotechnology industry. Let’s say we are only interested in salary, education level, and location of each individual. To see all of the column names present in the data frame, we can use the names() function:

names(biotech)

##  [1] "timestamp"                         "location"                         
##  [3] "biotech_sub_industry"              "company_name"                     
##  [5] "company_details"                   "approximate_company_size"         
##  [7] "title"                             "highest_achieved_formal_education"
##  [9] "years_of_experience"               "annual_base_salary"               
## [11] "annual_target_bonus"               "annual_equity"                    
## [13] "sign_on_bonus"                     "sign_on_relocation_assistance"

The columns that correspond to salary, education level, and location are annual_base_salary, highest_achieved_formal_education, and location, respectively. We can extract these columns with the select() function, and then save the result to a new data frame called biotech_subset:

# extract the "location", "annual_base_salary" and "highest_achieved_formal_education" columns
biotech_subset <- biotech %>%
  select(location, annual_base_salary,
         highest_achieved_formal_education)

Then, let’s say, we’re interested in looking at individuals with jobs in the Bay Area. To obtain this information, we need to use filter() to extract the rows where location is equal to “Bay Area”, or location == "Bay Area". Notice that, in this example, I do not assign the result to a new data frame, and thus the result of the function is not “saved” (i.e., only displayed on the console or inline output).

# filter subset biotech data for individuals working in the Bay Area
biotech_subset %>%
  filter(location == "Bay Area")

## # A tibble: 88 × 3
##    location annual_base_salary highest_achieved_formal_education
##    <chr>                 <dbl> <chr>                            
##  1 Bay Area              82000 Bachelor's Degree or Equivalent  
##  2 Bay Area             131200 Bachelor's Degree or Equivalent  
##  3 Bay Area             155000 Ph.D or Equivalent               
##  4 Bay Area             220000 Ph.D or Equivalent               
##  5 Bay Area              95000 Master's Degree or Equivalent    
##  6 Bay Area              75000 Bachelor's Degree or Equivalent  
##  7 Bay Area             100000 Bachelor's Degree or Equivalent  
##  8 Bay Area             170000 Ph.D or Equivalent               
##  9 Bay Area             148000 Ph.D or Equivalent               
## 10 Bay Area             135000 Ph.D or Equivalent               
## # … with 78 more rows

One of the best features of the tidyverse is that it encourages piping, which is a process that executes multiple functions at once on a data set. This feature can save us many, many lines of codes while still maintaining excellent readability. Referring to the above example, we can chain both steps together using %>%, also known as the tidyverse pipe:

biotech_filtered <- biotech %>%
  select(location, annual_base_salary,
         highest_achieved_formal_education) %>%
  filter(location == "Bay Area")

Problem Set #1:

Take the biotech data set and use select() to extract the columns location, highest_achieved_formal_education, and annual_base_salary, biotech_sub_industry, and approximate_company_size. Assign it to a new data frame called my_biotech.
Filter the my_biotech dataframe such that you only keep rows where annual_base_salary is more than $75,000.

# extract the "location", "annual_base_salary", "highest_achieved_formal_education", "biotech_sub_industry", and "approximate_company_size" columns 
my_biotech <- biotech %>%
  select(location, annual_base_salary,
         highest_achieved_formal_education,
         biotech_sub_industry, approximate_company_size)

# filter my_biotech so it only contains individuals with base salary 
my_biotech %>%
  filter(annual_base_salary > 75000)

## # A tibble: 327 × 5
##    location           annual_base_salary highest_achieved_form…¹ biote…² appro…³
##    <chr>                           <dbl> <chr>                   <chr>   <chr>  
##  1 California (Other)             160000 Master's Degree or Equ… Consul… 50-200 
##  2 Boston Metro Area              120000 Ph.D or Equivalent      Pharma… 5000+  
##  3 Canada                         140000 Associate’s Degree or … Pharma… 1-50   
##  4 Bay Area                        82000 Bachelor's Degree or E… Pharma… 1000-5…
##  5 Seattle                        140000 Ph.D or Equivalent      Medica… 1000-5…
##  6 Seattle                        140000 Ph.D or Equivalent      Academ… 50-200 
##  7 San Diego                      103000 Ph.D or Equivalent      Medica… 5000+  
##  8 Bay Area                       131200 Bachelor's Degree or E… Pharma… 50-200 
##  9 Boston Metro Area              100000 Master's Degree or Equ… Pharma… 200-10…
## 10 Bay Area                       155000 Ph.D or Equivalent      Academ… 1-50   
## # … with 317 more rows, and abbreviated variable names
## #   ¹highest_achieved_formal_education, ²biotech_sub_industry,
## #   ³approximate_company_size

Challenge Problem: If that was easy… How many individuals have a salary greater than or equal to $105,000 and work in the “subindustries” of “Academia/Research”?

# execute multiple mutually inclusive filters with '&'
rich_profs <- biotech %>%
  select(location, annual_base_salary,
         biotech_sub_industry) %>%
  filter(biotech_sub_industry == "Academia/Research" & annual_base_salary >= 105000)
rich_profs

## # A tibble: 8 × 3
##   location          annual_base_salary biotech_sub_industry
##   <chr>                          <dbl> <chr>               
## 1 Seattle                       140000 Academia/Research   
## 2 Bay Area                      155000 Academia/Research   
## 3 Bay Area                      120000 Academia/Research   
## 4 Boston Metro Area             120000 Academia/Research   
## 5 San Diego                     120000 Academia/Research   
## 6 Boston Metro Area             154000 Academia/Research   
## 7 Bay Area                      115000 Academia/Research   
## 8 Boston Metro Area             160000 Academia/Research

Eight r/biotech Reddit users work in academia and have a salary >= $105,000. They live in the Bay Area (3), Boston Metro Area (3), San Diego (1) and Seattle (1).

2. Data aggregation (`group_by()` and `summarize()`)

Most of the time, our goal is to learn something about the relationships between the variables in our data set. For instance, the biotech data set samples a diversity of Reddit users working in the biotechnology industry; how much does the average salary vary across level of education? With group_by and summarize, we can do a quick investigation.

# calculate the mean and median salary
# for each level of education
biotech %>%
  group_by(highest_achieved_formal_education) %>%
  summarize(median_salary = median(annual_base_salary),
            mean_salary = mean(annual_base_salary))

## # A tibble: 6 × 3
##   highest_achieved_formal_education median_salary mean_salary
##   <chr>                                     <dbl>       <dbl>
## 1 Associate’s Degree or Equivalent         55000       61817.
## 2 Bachelor's Degree or Equivalent          80293       87063.
## 3 High School or Equivalent                59249.      52124.
## 4 M.D./PharmD/D.D.S. or Equivalent         90000       89333.
## 5 Master's Degree or Equivalent            90000       95648.
## 6 Ph.D or Equivalent                      135000      138826.

Note the way I’ve structured the above code. At the beginning, the data set we’re manipulating is on the first line, and each subsequent function the data passes through gets a new line/indentation. Though not functionally necessary, styling your R/tidyverse code this way greatly improves readability. Readability of your code should ALWAYS be high priority.

We can also probe potential biases in our data set–for example, the sample size for each of the levels of education we probed above.

# add a calculation for the # of individuals per level of education
biotech %>%
  group_by(highest_achieved_formal_education) %>%
  summarize(median_salary = median(annual_base_salary),
            mean_salary = mean(annual_base_salary),
            count = n())

## # A tibble: 6 × 4
##   highest_achieved_formal_education median_salary mean_salary count
##   <chr>                                     <dbl>       <dbl> <int>
## 1 Associate’s Degree or Equivalent         55000       61817.    11
## 2 Bachelor's Degree or Equivalent          80293       87063.   193
## 3 High School or Equivalent                59249.      52124.     4
## 4 M.D./PharmD/D.D.S. or Equivalent         90000       89333.     3
## 5 Master's Degree or Equivalent            90000       95648.    73
## 6 Ph.D or Equivalent                      135000      138826.   166

The number of respondents with a high school eduction and professional school education are very low compared to the other categories.

Problem Set #2:

How does the mean salary vary by location? Using the biotech data set again, use group_by() on the location column and summarize() the mean() of annual_base_salary.

# calculate the mean
biotech %>%
  group_by(location) %>%
  summarize(mean_salary = mean(annual_base_salary)) %>%
  arrange(desc(mean_salary))

## # A tibble: 27 × 2
##    location           mean_salary
##    <chr>                    <dbl>
##  1 Bay Area               124725.
##  2 Remote                 124000 
##  3 San Diego              120871.
##  4 Denver Metro Area      119613.
##  5 NYC Metro Area         117021 
##  6 Boston Metro Area      113266.
##  7 D.C. Metro Area        108610 
##  8 Midwest (Other)        107786.
##  9 California (Other)     104144.
## 10 Seattle                100214.
## # … with 17 more rows

Challenge Problem: If that was easy… For individuals working in the Boston Metro Area, what is the mean salary for the different types biotech_sub_industry and highest_achieved_formal_education? Include a column that calculates the sample size of each group.

Hint: This will require using the filter() function first to just get rows corresponding to the Boston Metro Area, followed by a group_by() of multiple variables, and then finally summarize().

biotech %>%
  filter(location == "Boston Metro Area") %>%
  group_by(highest_achieved_formal_education, biotech_sub_industry) %>% 
  summarize(mean_salary = mean(annual_base_salary),
            count = n()) %>%
  arrange(desc(mean_salary))

## `summarise()` has grouped output by 'highest_achieved_formal_education'. You
## can override using the `.groups` argument.

## # A tibble: 19 × 4
## # Groups:   highest_achieved_formal_education [6]
##    highest_achieved_formal_education biotech_sub_industry          mean_…¹ count
##    <chr>                             <chr>                           <dbl> <int>
##  1 Ph.D or Equivalent                Biotechnology                 160000      1
##  2 Ph.D or Equivalent                Pharmaceutical                145869.    45
##  3 Ph.D or Equivalent                Academia/Research             110400      5
##  4 Ph.D or Equivalent                Medical Devices               106200      1
##  5 Ph.D or Equivalent                Cosmetics                     103000      1
##  6 Ph.D or Equivalent                Biosynthesis                  100000      1
##  7 Bachelor's Degree or Equivalent   Pharmaceutical                 99412.    37
##  8 Master's Degree or Equivalent     Pharmaceutical                 99067.    22
##  9 Master's Degree or Equivalent     Biologics                      95000      1
## 10 Bachelor's Degree or Equivalent   Gene Therapy                   90500      1
## 11 M.D./PharmD/D.D.S. or Equivalent  Pharmaceutical                 90000      1
## 12 Ph.D or Equivalent                I’m in flavors and fragrance…  85000      1
## 13 Associate’s Degree or Equivalent  Academia/Research              82000      1
## 14 Bachelor's Degree or Equivalent   diagnostics                    73000      1
## 15 Associate’s Degree or Equivalent  Medical Devices                55000      1
## 16 Bachelor's Degree or Equivalent   Medical Devices                54340      2
## 17 Bachelor's Degree or Equivalent   Academia/Research              36000      1
## 18 Master's Degree or Equivalent     molecular bio                     70      1
## 19 High School or Equivalent         Academia/Research                  0      1
## # … with abbreviated variable name ¹mean_salary

3. Data visualization

Arguably the Tidyverse is the most used and well-known for its data visualization package, ggplot2. Its syntax is similar to but distinct from other tidyverse packages. In our previous examples using dplyr, we chained functions together using the pipe (%>%). When making a ggplot, we layer plot functions using a plus sign (+). Continuing to use the biotech data set, let’s visualize the relationship between years of experience in the biotechnology industry and annual base salary.

# plot annual base salary vs years of experience
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary)) +
  geom_point()

It looks like there might be correlation between years of experience and compensation, but it’s pretty noisy. Based on our preliminary analyses, the biggest contributors to variance are likely level of education and location. We can separate this plot into a grid of scatter plots for each category using facet_wrap().

# generate a separate scatter plot for each level of education
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary)) +
  geom_point() +
  facet_wrap(~highest_achieved_formal_education)

Another feature of ggplot() is the ability to color, shape, or size points by other variables in the data set. In the previous plot, we were interested in the correlation between base salary and years of experience. Now, we’ll visualize the approximate_company_size variable on the same plot using the color aesthetic. Since the default colors used by ggplot are not colorblind-friendly, we’ll add a layer that changes the point colors to a palette that is.

# change the color of points to reflect size of employer organization
ggplot(biotech, aes(x = years_of_experience, 
                    y = annual_base_salary,
                    color = approximate_company_size)) +
  geom_point() +
  facet_wrap(~highest_achieved_formal_education) +
  scale_color_colorblind()

## Warning: Removed 2 rows containing missing values (geom_point).

One extremely common task I often run into is the re-ordering of legend variables. For ease of interpretation, we want the company size ranges to either ascend or descend in order. This is not automatically implemented by R because the ranges are considered a “character” or “string” data type, so R therefor does not recognize the values as numbers that can be ordered. However, we can manually override this ordering using fct_relevel.

The first step in this process is to create a vector that specifies the desired order:

# specify the categories in the desired order
desired_order <- c("1-50", "50-200", 
                   "200-1000", "1000-5000", "5000+")

Next, we can use mutate() to change the data type of the column to a manually ordered “factor” column and pass that directly to ggplot. Note that the results of the mutate() function pipes into ggplot(), and therefore we do not have to specify the data set as our first argument to the ggplot() function.

# make plot with reordered legend variables
biotech %>%
  mutate(
    approximate_company_size = fct_relevel(
      approximate_company_size, desired_order)
    ) %>%
    ggplot(., aes(x = years_of_experience, 
                      y = annual_base_salary,
                      color = approximate_company_size)) +
    geom_point() +
    facet_wrap(~highest_achieved_formal_education) +
    scale_color_colorblind()

## Warning: Removed 2 rows containing missing values (geom_point).

Finally, we can apply a number of layers to the plot to make it publication ready. The layer labs() will allow us to change the legends, and we can specify point size and degree of point opacity inside geom_point(). Additionally, we can pre-filter NAs from our data set before passing the data to ggplot using the filter() function.

# make it pretty
biotech %>%
  mutate(
    approximate_company_size = fct_relevel(
      approximate_company_size, desired_order)
    ) %>%
  filter(!is.na(approximate_company_size)) %>%
    ggplot(., aes(x = years_of_experience, 
                      y = annual_base_salary,
                      color = approximate_company_size)) +
    geom_point(size = 2, alpha = 0.8) +
    facet_wrap(~highest_achieved_formal_education) +
    scale_color_colorblind() +
    labs(x = "Years of Experience", y = "Base salary",
         color = "Company Size")

Problem Set #3:

Filter the biotech data to contain only the following, and then assign it to a new data frame:
- Individuals working in the Bay Area & the Boston Metro Area (e.g., create a filter for the location column)
- Individuals working for pharmaceutical companies (e.g., create a filter for the biotech_sub_industry column)
Make a scatter plot with years_of_experience on the x-axis and annual_base_salary on the y-axis.
Color the points by the highest_achieved_formal_education column.
Change the color of the points to a colorblind-friendly palette.
Rename the axes labels using labs(). Bonus: Change the shape of the points to reflect the company_details column.

# make filtered data set
filtered_df <- biotech %>%
  filter(location %in% c("Bay Area", "Boston Metro Area")) %>%
  filter(biotech_sub_industry == "Pharmaceutical")

# generate plot for pharma workers
ggplot(filtered_df, aes(x = years_of_experience, 
                        y = annual_base_salary, 
                        color = highest_achieved_formal_education,
                        shape = company_details)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73",
                     "#0072B2", "#D55E00", "#CC79A7")) +
  labs(x = "Years of Experience", y = "Base salary",
       color = "Education", shape = "Organization\tType")

Challenge Problem: If that was easy…

Filter the biotech data to contain only:
- The top 3 locations with the most data points.
- Only individuals with Bachelor’s, Master’s and Doctorate’s degrees.
Make a density plot showing the distribution of salaries for each level of education (hint: assign x = annual_base_salary, fill = highest_achieved_formal_education, and then use geom_density()).
Use facet_wrap() with location to make a grid of 3 density distribution plots.
Change the default colors and reduce the opacity of the density distributions to visualize the overlaps.
Move the legend position to the top of the plot.

# find the top 3 locations with most data points
top3 <- biotech %>%
  group_by(location) %>%
  tally() %>%
  top_n(3, wt = n)

target_education <- c("Bachelor's Degree or Equivalent",
         "Master's Degree or Equivalent",
         "Ph.D or Equivalent")

# generate plot
biotech %>%
  filter(location %in% top3$location) %>%
  filter(highest_achieved_formal_education %in% target_education) %>%
  mutate(   # make education labels wrap
    edu_pretty = str_wrap(
      highest_achieved_formal_education,
      width = 20)) %>%
  ggplot(., aes(x = annual_base_salary,
                fill = edu_pretty)) +
  geom_density(alpha = 0.75) +
  facet_wrap(~location) +  # make a plot for each location
  scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73",
                     "#0072B2", "#D55E00", "#CC79A7")) +
  labs(x = "Base salary", y = "Density",
       fill = "Education") +
  theme(legend.position = "top")  # move legend to top

Introduction to R

In-class worksheet #2, solutions

1. Data extraction (select(), filter())

2. Data aggregation (group_by() and summarize())

3. Data visualization

1. Data extraction (`select()`, `filter()`)

2. Data aggregation (`group_by()` and `summarize()`)