November 9th, 2022
There are many external packages that computational scientists rely
on for data analysis (e.g., DESeq2
for RNA-seq processing),
but perhaps one of the most important is the tidyverse
library. The tidyverse
library is a collection of R
packages made for streamlining interactive data analysis. Most of the
core functions are designed to interact with tabular data in a “tidy”
format, where there is 1 column per variable and and 1 row per
observation (see slides for examples). Though there are many packages
within the tidyverse
, we’re going to focus on two of the
most commonly used: dplyr
for data manipulation, and
ggplot
for data visualization.
select()
,
filter()
)Given a large, complicated, and/or messy data set, we may only be
interested in examining a few variables or conditions. We will use the
biotech
data set to demonstrate how to subset data.
Run the code chunk below to load and take a glance at the data.
biotech <- read_csv("https://rachaelcox.github.io/classes/datasets/salary_data.csv")
## Rows: 450 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): timestamp, location, biotech_sub_industry, company_name, company_d...
## dbl (4): years_of_experience, annual_base_salary, sign_on_bonus, sign_on_re...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(biotech)
## # A tibble: 6 × 14
## timest…¹ locat…² biote…³ compa…⁴ compa…⁵ appro…⁶ title highe…⁷ years…⁸ annua…⁹
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 5/8/202… Califo… Consul… <NA> Private 50-200 Seni… Master… 6 160000
## 2 5/8/202… San Di… Pharma… <NA> Public 50-200 Rese… Bachel… 0 55000
## 3 5/8/202… Midwes… Academ… <NA> Academ… 5000+ Rese… Bachel… 2 28000
## 4 5/8/202… NYC Me… Pharma… Wuxi B… Public 1000-5… Manu… Bachel… 6 72800
## 5 5/8/202… Boston… Pharma… Abbvie Public 5000+ Seni… Ph.D o… 0 120000
## 6 5/8/202… Canada Pharma… <NA> Start-… 1-50 Soft… Associ… 20 140000
## # … with 4 more variables: annual_target_bonus <chr>, annual_equity <chr>,
## # sign_on_bonus <dbl>, sign_on_relocation_assistance <dbl>, and abbreviated
## # variable names ¹timestamp, ²location, ³biotech_sub_industry, ⁴company_name,
## # ⁵company_details, ⁶approximate_company_size,
## # ⁷highest_achieved_formal_education, ⁸years_of_experience,
## # ⁹annual_base_salary
The biotech
data set was pulled from a survey
distributed on the Reddit forum r/biotech in 2022,
where 450 participants answered questions regarding their role,
compensation, level of education and location as it relates to their
employment in the biotechnology industry. Let’s say we are only
interested in salary, education level, and location of each individual.
To see all of the column names present in the data frame, we can use the
names()
function:
names(biotech)
## [1] "timestamp" "location"
## [3] "biotech_sub_industry" "company_name"
## [5] "company_details" "approximate_company_size"
## [7] "title" "highest_achieved_formal_education"
## [9] "years_of_experience" "annual_base_salary"
## [11] "annual_target_bonus" "annual_equity"
## [13] "sign_on_bonus" "sign_on_relocation_assistance"
The columns that correspond to salary, education level, and location
are annual_base_salary
,
highest_achieved_formal_education
, and
location
, respectively. We can extract these columns with
the select()
function, and then save the result to a new
data frame called biotech_subset
:
# extract the "location", "annual_base_salary" and "highest_achieved_formal_education" columns
biotech_subset <- biotech %>%
select(location, annual_base_salary,
highest_achieved_formal_education)
Then, let’s say, we’re interested in looking at individuals with jobs
in the Bay Area. To obtain this information, we need to use
filter()
to extract the rows where location
is
equal to “Bay Area”, or location == "Bay Area"
. Notice
that, in this example, I do not assign the result to a new data frame,
and thus the result of the function is not “saved” (i.e., only displayed
on the console or inline output).
# filter subset biotech data for individuals working in the Bay Area
biotech_subset %>%
filter(location == "Bay Area")
## # A tibble: 88 × 3
## location annual_base_salary highest_achieved_formal_education
## <chr> <dbl> <chr>
## 1 Bay Area 82000 Bachelor's Degree or Equivalent
## 2 Bay Area 131200 Bachelor's Degree or Equivalent
## 3 Bay Area 155000 Ph.D or Equivalent
## 4 Bay Area 220000 Ph.D or Equivalent
## 5 Bay Area 95000 Master's Degree or Equivalent
## 6 Bay Area 75000 Bachelor's Degree or Equivalent
## 7 Bay Area 100000 Bachelor's Degree or Equivalent
## 8 Bay Area 170000 Ph.D or Equivalent
## 9 Bay Area 148000 Ph.D or Equivalent
## 10 Bay Area 135000 Ph.D or Equivalent
## # … with 78 more rows
One of the best features of the tidyverse
is that it
encourages piping, which is a process that executes multiple functions
at once on a data set. This feature can save us many, many lines of
codes while still maintaining excellent readability. Referring to the
above example, we can chain both steps together using
%>%
, also known as the tidyverse
pipe:
biotech_filtered <- biotech %>%
select(location, annual_base_salary,
highest_achieved_formal_education) %>%
filter(location == "Bay Area")
Problem Set #1:
biotech
data set and use select()
to extract the columns location
,
highest_achieved_formal_education
, and
annual_base_salary
, biotech_sub_industry
, and
approximate_company_size
. Assign it to a new data frame
called my_biotech
.my_biotech
dataframe such that you only keep
rows where annual_base_salary
is more than $75,000.# extract the "location", "annual_base_salary", "highest_achieved_formal_education", "biotech_sub_industry", and "approximate_company_size" columns
my_biotech <- biotech %>%
select(location, annual_base_salary,
highest_achieved_formal_education,
biotech_sub_industry, approximate_company_size)
# filter my_biotech so it only contains individuals with base salary
my_biotech %>%
filter(annual_base_salary > 75000)
## # A tibble: 327 × 5
## location annual_base_salary highest_achieved_form…¹ biote…² appro…³
## <chr> <dbl> <chr> <chr> <chr>
## 1 California (Other) 160000 Master's Degree or Equ… Consul… 50-200
## 2 Boston Metro Area 120000 Ph.D or Equivalent Pharma… 5000+
## 3 Canada 140000 Associate’s Degree or … Pharma… 1-50
## 4 Bay Area 82000 Bachelor's Degree or E… Pharma… 1000-5…
## 5 Seattle 140000 Ph.D or Equivalent Medica… 1000-5…
## 6 Seattle 140000 Ph.D or Equivalent Academ… 50-200
## 7 San Diego 103000 Ph.D or Equivalent Medica… 5000+
## 8 Bay Area 131200 Bachelor's Degree or E… Pharma… 50-200
## 9 Boston Metro Area 100000 Master's Degree or Equ… Pharma… 200-10…
## 10 Bay Area 155000 Ph.D or Equivalent Academ… 1-50
## # … with 317 more rows, and abbreviated variable names
## # ¹highest_achieved_formal_education, ²biotech_sub_industry,
## # ³approximate_company_size
Challenge Problem: If that was easy… How many individuals have a salary greater than or equal to $105,000 and work in the “subindustries” of “Academia/Research”?
# execute multiple mutually inclusive filters with '&'
rich_profs <- biotech %>%
select(location, annual_base_salary,
biotech_sub_industry) %>%
filter(biotech_sub_industry == "Academia/Research" & annual_base_salary >= 105000)
rich_profs
## # A tibble: 8 × 3
## location annual_base_salary biotech_sub_industry
## <chr> <dbl> <chr>
## 1 Seattle 140000 Academia/Research
## 2 Bay Area 155000 Academia/Research
## 3 Bay Area 120000 Academia/Research
## 4 Boston Metro Area 120000 Academia/Research
## 5 San Diego 120000 Academia/Research
## 6 Boston Metro Area 154000 Academia/Research
## 7 Bay Area 115000 Academia/Research
## 8 Boston Metro Area 160000 Academia/Research
Eight r/biotech Reddit users work in academia and have a salary >= $105,000. They live in the Bay Area (3), Boston Metro Area (3), San Diego (1) and Seattle (1).
group_by()
and
summarize()
)Most of the time, our goal is to learn something about the
relationships between the variables in our data set. For instance, the
biotech
data set samples a diversity of Reddit users
working in the biotechnology industry; how much does the average salary
vary across level of education? With group_by
and
summarize
, we can do a quick investigation.
# calculate the mean and median salary
# for each level of education
biotech %>%
group_by(highest_achieved_formal_education) %>%
summarize(median_salary = median(annual_base_salary),
mean_salary = mean(annual_base_salary))
## # A tibble: 6 × 3
## highest_achieved_formal_education median_salary mean_salary
## <chr> <dbl> <dbl>
## 1 Associate’s Degree or Equivalent 55000 61817.
## 2 Bachelor's Degree or Equivalent 80293 87063.
## 3 High School or Equivalent 59249. 52124.
## 4 M.D./PharmD/D.D.S. or Equivalent 90000 89333.
## 5 Master's Degree or Equivalent 90000 95648.
## 6 Ph.D or Equivalent 135000 138826.
Note the way I’ve structured the above code. At the beginning, the data set we’re manipulating is on the first line, and each subsequent function the data passes through gets a new line/indentation. Though not functionally necessary, styling your R/tidyverse code this way greatly improves readability. Readability of your code should ALWAYS be high priority.
We can also probe potential biases in our data set–for example, the sample size for each of the levels of education we probed above.
# add a calculation for the # of individuals per level of education
biotech %>%
group_by(highest_achieved_formal_education) %>%
summarize(median_salary = median(annual_base_salary),
mean_salary = mean(annual_base_salary),
count = n())
## # A tibble: 6 × 4
## highest_achieved_formal_education median_salary mean_salary count
## <chr> <dbl> <dbl> <int>
## 1 Associate’s Degree or Equivalent 55000 61817. 11
## 2 Bachelor's Degree or Equivalent 80293 87063. 193
## 3 High School or Equivalent 59249. 52124. 4
## 4 M.D./PharmD/D.D.S. or Equivalent 90000 89333. 3
## 5 Master's Degree or Equivalent 90000 95648. 73
## 6 Ph.D or Equivalent 135000 138826. 166
The number of respondents with a high school eduction and professional school education are very low compared to the other categories.
Problem Set #2:
biotech
data set again, use group_by()
on the
location
column and summarize()
the
mean()
of annual_base_salary
.# calculate the mean
biotech %>%
group_by(location) %>%
summarize(mean_salary = mean(annual_base_salary)) %>%
arrange(desc(mean_salary))
## # A tibble: 27 × 2
## location mean_salary
## <chr> <dbl>
## 1 Bay Area 124725.
## 2 Remote 124000
## 3 San Diego 120871.
## 4 Denver Metro Area 119613.
## 5 NYC Metro Area 117021
## 6 Boston Metro Area 113266.
## 7 D.C. Metro Area 108610
## 8 Midwest (Other) 107786.
## 9 California (Other) 104144.
## 10 Seattle 100214.
## # … with 17 more rows
Challenge Problem: If that was easy… For individuals
working in the Boston Metro Area, what is the mean salary for the
different types biotech_sub_industry
and
highest_achieved_formal_education
? Include a column that
calculates the sample size of each group.
Hint: This will require using the
filter()
function first to just get rows corresponding to
the Boston Metro Area, followed by a group_by()
of multiple
variables, and then finally summarize()
.
biotech %>%
filter(location == "Boston Metro Area") %>%
group_by(highest_achieved_formal_education, biotech_sub_industry) %>%
summarize(mean_salary = mean(annual_base_salary),
count = n()) %>%
arrange(desc(mean_salary))
## `summarise()` has grouped output by 'highest_achieved_formal_education'. You
## can override using the `.groups` argument.
## # A tibble: 19 × 4
## # Groups: highest_achieved_formal_education [6]
## highest_achieved_formal_education biotech_sub_industry mean_…¹ count
## <chr> <chr> <dbl> <int>
## 1 Ph.D or Equivalent Biotechnology 160000 1
## 2 Ph.D or Equivalent Pharmaceutical 145869. 45
## 3 Ph.D or Equivalent Academia/Research 110400 5
## 4 Ph.D or Equivalent Medical Devices 106200 1
## 5 Ph.D or Equivalent Cosmetics 103000 1
## 6 Ph.D or Equivalent Biosynthesis 100000 1
## 7 Bachelor's Degree or Equivalent Pharmaceutical 99412. 37
## 8 Master's Degree or Equivalent Pharmaceutical 99067. 22
## 9 Master's Degree or Equivalent Biologics 95000 1
## 10 Bachelor's Degree or Equivalent Gene Therapy 90500 1
## 11 M.D./PharmD/D.D.S. or Equivalent Pharmaceutical 90000 1
## 12 Ph.D or Equivalent I’m in flavors and fragrance… 85000 1
## 13 Associate’s Degree or Equivalent Academia/Research 82000 1
## 14 Bachelor's Degree or Equivalent diagnostics 73000 1
## 15 Associate’s Degree or Equivalent Medical Devices 55000 1
## 16 Bachelor's Degree or Equivalent Medical Devices 54340 2
## 17 Bachelor's Degree or Equivalent Academia/Research 36000 1
## 18 Master's Degree or Equivalent molecular bio 70 1
## 19 High School or Equivalent Academia/Research 0 1
## # … with abbreviated variable name ¹mean_salary
Arguably the Tidyverse is the most used and well-known for its data
visualization package, ggplot2
. Its syntax is similar to
but distinct from other tidyverse
packages. In our previous
examples using dplyr
, we chained functions together using
the pipe (%>%
). When making a ggplot, we layer
plot functions using a plus sign (+
). Continuing to use the
biotech
data set, let’s visualize the relationship between
years of experience in the biotechnology industry and annual base
salary.
# plot annual base salary vs years of experience
ggplot(biotech, aes(x = years_of_experience,
y = annual_base_salary)) +
geom_point()
It looks like there might be correlation between years of experience
and compensation, but it’s pretty noisy. Based on our preliminary
analyses, the biggest contributors to variance are likely level of
education and location. We can separate this plot into a grid of scatter
plots for each category using facet_wrap()
.
# generate a separate scatter plot for each level of education
ggplot(biotech, aes(x = years_of_experience,
y = annual_base_salary)) +
geom_point() +
facet_wrap(~highest_achieved_formal_education)
Another feature of ggplot()
is the ability to color,
shape, or size points by other variables in the data set. In the
previous plot, we were interested in the correlation between base salary
and years of experience. Now, we’ll visualize the
approximate_company_size
variable on the same plot using
the color
aesthetic. Since the default colors used by
ggplot
are not colorblind-friendly, we’ll add a layer that
changes the point colors to a palette that is.
# change the color of points to reflect size of employer organization
ggplot(biotech, aes(x = years_of_experience,
y = annual_base_salary,
color = approximate_company_size)) +
geom_point() +
facet_wrap(~highest_achieved_formal_education) +
scale_color_colorblind()
## Warning: Removed 2 rows containing missing values (geom_point).
One extremely common task I often run into is the re-ordering of
legend variables. For ease of interpretation, we want the company size
ranges to either ascend or descend in order. This is not automatically
implemented by R because the ranges are considered a “character” or
“string” data type, so R therefor does not recognize the values as
numbers that can be ordered. However, we can manually override this
ordering using fct_relevel
.
The first step in this process is to create a vector that specifies the desired order:
# specify the categories in the desired order
desired_order <- c("1-50", "50-200",
"200-1000", "1000-5000", "5000+")
Next, we can use mutate()
to change the data type of the
column to a manually ordered “factor” column and pass that directly to
ggplot. Note that the results of the mutate()
function
pipes into ggplot()
, and therefore we do not have to
specify the data set as our first argument to the ggplot()
function.
# make plot with reordered legend variables
biotech %>%
mutate(
approximate_company_size = fct_relevel(
approximate_company_size, desired_order)
) %>%
ggplot(., aes(x = years_of_experience,
y = annual_base_salary,
color = approximate_company_size)) +
geom_point() +
facet_wrap(~highest_achieved_formal_education) +
scale_color_colorblind()
## Warning: Removed 2 rows containing missing values (geom_point).
Finally, we can apply a number of layers to the plot to make it
publication ready. The layer labs()
will allow us to change
the legends, and we can specify point size and degree of point opacity
inside geom_point()
. Additionally, we can pre-filter NAs
from our data set before passing the data to ggplot
using
the filter()
function.
# make it pretty
biotech %>%
mutate(
approximate_company_size = fct_relevel(
approximate_company_size, desired_order)
) %>%
filter(!is.na(approximate_company_size)) %>%
ggplot(., aes(x = years_of_experience,
y = annual_base_salary,
color = approximate_company_size)) +
geom_point(size = 2, alpha = 0.8) +
facet_wrap(~highest_achieved_formal_education) +
scale_color_colorblind() +
labs(x = "Years of Experience", y = "Base salary",
color = "Company Size")
Problem Set #3:
biotech
data to contain only the following,
and then assign it to a new data frame:
location
column)biotech_sub_industry
column)years_of_experience
on the
x-axis and annual_base_salary
on the y-axis.highest_achieved_formal_education
column.labs()
.
Bonus: Change the shape of the points to reflect the
company_details
column.# make filtered data set
filtered_df <- biotech %>%
filter(location %in% c("Bay Area", "Boston Metro Area")) %>%
filter(biotech_sub_industry == "Pharmaceutical")
# generate plot for pharma workers
ggplot(filtered_df, aes(x = years_of_experience,
y = annual_base_salary,
color = highest_achieved_formal_education,
shape = company_details)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73",
"#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Years of Experience", y = "Base salary",
color = "Education", shape = "Organization\tType")
Challenge Problem: If that was easy…
biotech
data to contain only:
x = annual_base_salary
,
fill = highest_achieved_formal_education
, and then use
geom_density()
).facet_wrap()
with location to make a grid of 3
density distribution plots.# find the top 3 locations with most data points
top3 <- biotech %>%
group_by(location) %>%
tally() %>%
top_n(3, wt = n)
target_education <- c("Bachelor's Degree or Equivalent",
"Master's Degree or Equivalent",
"Ph.D or Equivalent")
# generate plot
biotech %>%
filter(location %in% top3$location) %>%
filter(highest_achieved_formal_education %in% target_education) %>%
mutate( # make education labels wrap
edu_pretty = str_wrap(
highest_achieved_formal_education,
width = 20)) %>%
ggplot(., aes(x = annual_base_salary,
fill = edu_pretty)) +
geom_density(alpha = 0.75) +
facet_wrap(~location) + # make a plot for each location
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73",
"#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Base salary", y = "Density",
fill = "Education") +
theme(legend.position = "top") # move legend to top