library(readr)
# here the first argument is a path
<- read_csv("data/cms_hospital_patient_satisfaction.csv")
cms_data # convert column names to R standard
library(janitor) # remember to install janitor: install.packages("janitor")
<- cms_data |> clean_names() cms_data
Step 3: Transforming Data
The dplyr
package (part of the tidyverse), aims to simplify the process of manipulating and transforming data in a straightforward and user-friendly manner. This is the second step in the tidyverse workflow.
A central principle of tidyverse packages, is the emphasis on minimizing the number of keystrokes and characters needed to attain desired results. In dplyr
, the use of quotation marks for column names in data frames is often unnecessary. Another noteworthy aspect is that both the input to and output from all functions are in the form of data frames.
dplyr
package offers a set of key functions, referred to as ‘verbs’, which can be combined to achieve specific and targeted outcomes. Users familiar with functions in Microsoft Excel may recognize similarities in the functionality provided by dplyr
.
Before we delve into these functions in detail, let’s first explore filtering or subsetting data frames using base R functions.
Subsetting Data Frames
A frequently encountered task in data manipulation is filtering or subsetting data to a more focused and potentially relevant subset of values. Data frames (or tibbles) can be subset using base R functions.
Let’s start by reading the cms_hospital_patient_satisfaction_2016_sampled.csv into a data frame using the read_csv()
function in readr
package:
Subset by position
Here we use the [row, col]
syntax to subset data frames.
- To display a single value:
# display the value in 4th row and 2nd column
4, 2] cms_data[
Output
facility_name |
---|
MERCY HOSPITAL FORT SMITH |
- To display a single row: Here, the column value is omitted, thereby retrieving the entire column.
# display the 4th row
4,] cms_data[
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
- To display a single column: (Note: columns can be given by name as well) Here, the row value is omitted, thereby retrieving the entire column.
# display the 3rd column
3] # is same as cms_data[3]
cms_data[, # display the 3rd column (or County column)
"county"] # is same as cms_data["County"] cms_data[,
Output
county |
---|
SAN DIEGO |
COOK |
LAKE |
SEBASTIAN |
SHELBY |
BRAZOS |
GREENE |
MONTEZUMA |
VIRGINIA BEACH |
FAYETTE |
LOS ANGELES |
MIAMI-DADE |
SUMNER |
ISLAND |
LOS ANGELES |
county |
---|
SAN DIEGO |
COOK |
LAKE |
SEBASTIAN |
SHELBY |
BRAZOS |
GREENE |
MONTEZUMA |
VIRGINIA BEACH |
FAYETTE |
LOS ANGELES |
MIAMI-DADE |
SUMNER |
ISLAND |
LOS ANGELES |
- To display a range of rows:
# using a vector of indexes
c(3, 5, 1), ]
cms_data[# subsetting
2:4,] cms_data[
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
- To display a range of columns:
# using a vector of indexes
c(1, 3, 6), ]
cms_data[, # subsetting
4:6]
cms_data[, # using a vector of column names
c("hospital_type", "overall_rating")] cms_data[,
Output
id | county | no_of_surveys |
---|---|---|
050424 | SAN DIEGO | 3110 |
140103 | COOK | 264 |
100051 | LAKE | 1382 |
040062 | SEBASTIAN | 2506 |
440048 | SHELBY | 1799 |
450011 | BRAZOS | 1379 |
151317 | GREENE | 114 |
061327 | MONTEZUMA | 247 |
490057 | VIRGINIA BEACH | 619 |
110215 | FAYETTE | 1714 |
050704 | LOS ANGELES | 241 |
100296 | MIAMI-DADE | 393 |
440003 | SUMNER | 680 |
501339 | ISLAND | 389 |
050116 | LOS ANGELES | 1110 |
hospital_type | star_rating | no_of_surveys |
---|---|---|
Acute Care Hospital | 4 | 3110 |
Acute Care Hospital | 1 | 264 |
Acute Care Hospital | 2 | 1382 |
Acute Care Hospital | 3 | 2506 |
Acute Care Hospital | 2 | 1799 |
Acute Care Hospital | 3 | 1379 |
Critical Access Hospital | 3 | 114 |
Critical Access Hospital | 4 | 247 |
Acute Care Hospital | 4 | 619 |
Acute Care Hospital | 2 | 1714 |
Acute Care Hospital | 3 | 241 |
Acute Care Hospital | 4 | 393 |
Acute Care Hospital | 4 | 680 |
Critical Access Hospital | 3 | 389 |
Acute Care Hospital | 3 | 1110 |
hospital_type | overall_rating |
---|---|
Acute Care Hospital | 5 |
Acute Care Hospital | 2 |
Acute Care Hospital | 2 |
Acute Care Hospital | 3 |
Acute Care Hospital | 2 |
Acute Care Hospital | 3 |
Critical Access Hospital | 3 |
Critical Access Hospital | 3 |
Acute Care Hospital | 3 |
Acute Care Hospital | 2 |
Acute Care Hospital | 3 |
Acute Care Hospital | 3 |
Acute Care Hospital | 2 |
Critical Access Hospital | 3 |
Acute Care Hospital | 2 |
- To display multiple rows and columns:
2:6, c("hospital_type", "no_of_surveys", "response_rate")] cms_data[
Output
hospital_type | no_of_surveys | response_rate |
---|---|---|
Acute Care Hospital | 264 | 6 |
Acute Care Hospital | 1382 | 20 |
Acute Care Hospital | 2506 | 35 |
Acute Care Hospital | 1799 | 18 |
Acute Care Hospital | 1379 | 24 |
- To exclude a column (use -):
# display the data frame without Star_rating column
-5]
cms_data[# display the data frame, include only the hospital information and location
c(-5, -6, -7, -8)] # or cms_data[c(1, 2, 3, 4)] cms_data[
Output
id | facility_name | county | hospital_type | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 619 | 32 | 3 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 1714 | 21 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 1110 | 20 | 2 |
id | facility_name | county | hospital_type |
---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital |
Subset by condition
Apart from subsetting a vector based on the position of values, we can also ask questions about the set of values to R, and it will respond with TRUE or FALSE answers. This is often done using logical expressions to filter the data.
- Which cells contain the value “LOS ANGELES”?
== "LOS ANGELES" cms_data
Output
id facility_name county hospital_type star_rating no_of_surveys
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE TRUE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE TRUE FALSE FALSE FALSE
response_rate overall_rating
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
R returns TRUE
for values that satisfy the condition, and FALSE
for those that don’t.
How many cells contain the word “LOS ANGELES”?
We can use the sum()
function to compute the number of occurrences of “LOS ANGELES” in the data frame as it treats TRUE
as 1 and FALSE
as 0.
sum(cms_data == "LOS ANGELES")
Output
[1] 2
- Which cells contain the value 3?
== 3 cms_data
Output
id facility_name county hospital_type star_rating no_of_surveys
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE TRUE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE TRUE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE TRUE FALSE
[15,] FALSE FALSE FALSE FALSE TRUE FALSE
response_rate overall_rating
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE TRUE
[5,] FALSE FALSE
[6,] FALSE TRUE
[7,] FALSE TRUE
[8,] FALSE TRUE
[9,] FALSE TRUE
[10,] FALSE FALSE
[11,] FALSE TRUE
[12,] FALSE TRUE
[13,] FALSE FALSE
[14,] FALSE TRUE
[15,] FALSE FALSE
We will use the comparison operators (see Section: Comparison Operators and Expressions) and logical operators (see Section: Logical Operators and Expressions) we explored in the subsequent sections.
- Find all the facilities with a star rating above 3.
<- cms_data$star_rating > 3
sr_above_3 cms_data[sr_above_3, ]
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
- Find all the facility names with a star rating above 2.
2] cms_data[sr_above_3,
Output
facility_name |
---|
SCRIPPS GREEN HOSPITAL |
SOUTHWEST MEMORIAL HOSPITAL |
SENTARA GENERAL HOSPITAL |
DOCTORS HOSPITAL |
SUMNER REGIONAL MEDICAL CENTER |
- Find all the facilities with an overall rating of at least 3 and the response rate is above 30%.
# overall rating of at least 3
<- cms_data["overall_rating"] >= 3
orate_aleast_3 # response rate above 30\%
<- cms_data[7] > 30
rrate_above_30 # both conditions has to satisfy. hence and (&) operator
& rrate_above_30), ] cms_data[(orate_aleast_3
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
# the above 3 commands can be combined into a single-line command
"overall_rating"] >= 3) & (cms_data[7] > 30), ] cms_data[(cms_data[
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
Operators in R have a specific precedence, and they are executed either from left \(\rightarrow\) right or right \(\rightarrow\) left. This can impact the expected result if you are not mindful of operator precedence. It is advisable to use parentheses to explicitly define the order of operations since parentheses have the highest precedence. Refer to this image for a operator precedence table.
For example: 5 + 3 * 2 is not the same as (5 + 3) * 2. The multiplication operation takes precedence over addition, potentially leading to unexpected results. Using parentheses ensures that the addition operation is performed first, providing the desired outcome.
- Find the county of hospitals with any rating greater than or equal to 3.
# there are two ratings:
# 1. star_rating >= 3
<- cms_data$star_rating >= 3
srate_3 # 2. overall_rating >= 3
<- cms_data["overall_rating"] >= 3
orate_3 # any means at least one has to be >= 3 -> or operator
| orate_3, ]
cms_data[srate_3 # we only needs the county names
| orate_3, "county"] cms_data[srate_3
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
county |
---|
SAN DIEGO |
SEBASTIAN |
BRAZOS |
GREENE |
MONTEZUMA |
VIRGINIA BEACH |
LOS ANGELES |
MIAMI-DADE |
SUMNER |
ISLAND |
LOS ANGELES |
- How many hospitals are categorized as Acute Care Hospital?
# all the hospitals categorized as Acute Care Hospitals
<- cms_data$hospital_type == "Acute Care Hospital"
hosp_acute # summing the vector of TRUE (1) and FALSE (0) values
sum(hosp_acute)
Output
[1] 12
- Find the summary statistics number of surveys conducted at Critical Access Hospital.
# all the hospitals categorized as Critical Access Hospitals
<- cms_data$hospital_type == "Critical Access Hospital"
hosp_crit # number of surveys conducted in Critical Access Hospitals
<- cms_data[hosp_crit, "no_of_surveys"]
nsurv_crit # summary statistics of the number of surveys
summary(nsurv_crit)
Output
no_of_surveys
Min. :114.0
1st Qu.:180.5
Median :247.0
Mean :250.0
3rd Qu.:318.0
Max. :389.0
This approach can be somewhat cumbersome. It requires repeated referencing of the data frame name, leading to a multiple use of punctuation that needs careful management. In the following section, we will leverage the dplyr
package and its functions to craft more concise code, and efficient data manipulations.
Data manipulation with `dplyr’ functions
You’ll primarily use six key dplyr
functions for data manipulations:
filter()
: pick observations based on their values.select()
: pick variables by their names.mutate()
: create new variables using functions applied to existing variables.summarise()
: collapse multiple values into a single summary.group_by()
: group the rows based on specified criteria.arrange()
: reorder the rows based on specified criteria.
If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")
), let’s proceed to load it into our R session first:
library(tidyverse)
filter()
The filter()
function takes logical expressions and returns the rows for which all are TRUE
.
Example 1: Filter the cms_data data frame to find all the facilities with an overall_rating of 3.
|> filter(overall_rating == 3) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
Here we are sending the cms_data data frame into the function filter()
which tests each value in overall_rating column for the value 3 and returns the rows where this condition is TRUE.
You can check the dimension (number of rows and number of columns) of the resulting data frame by sending into the dim()
function as follows:
|> filter(overall_rating == 3) |> dim() cms_data
Output
[1] 8 8
Example 2: Find all the facilities categorized as “Acute Care Hospital”. Here we filter on character data.
|> filter(hospital_type == "Acute Care Hospital") cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
We can use logical operators introduced before to combine multiple conditions as follows.
Example 3: Find all the facilities categorized as “Acute Care Hospital” and has a overall rating of above 3.
|> filter(hospital_type == "Acute Care Hospital" & overall_rating > 3) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
Example 4: Find the facilities with any rating greater than or equal to 3.
|> filter(star_rating >= 3 | overall_rating >= 3) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
Example 5: Find the facilitites with any rating greater than or equal to 3 and the response rate is above 30.
|> filter(star_rating >= 3 | overall_rating >= 3 & response_rate > 30) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
The output of the above command is incorrect. Recall R operator precedence where & operator precedes | operator. Therefore, the command overall_rating >= 3 & response_rate > 30
is evaluated first. This can be verified by adding brackets around this command as follows: To fix the issue add brackets as follows:
|> filter(star_rating >= 3 | (overall_rating >= 3 & response_rate > 30)) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
To fix the issue add brackets as follows:
|> filter((star_rating >= 3 | overall_rating >= 3) & response_rate > 30) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
This results in the correct output.
%in%
helper
The %in%
function is used to determine whether elements of one vector are present in another vector. It returns a logical vector indicating whether each element of the first vector is found in the second vector.
When we want to filter a subset of rows that may contain multiple different values, it’s more efficient to provide a vector of the values of interest instead of combining multiple OR commands.
Example 6: Retrieve a subset of facilities that have an odd number of overall rating.
|> filter(overall_rating %in% c(1, 3, 5)) cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
str_detect
function
The str_detect()
function is part of the stringr
package and is used for pattern matching within strings. It allows you to search for a specific pattern or regular expression (discussed later) within a character vector or string.
Example 1: Find all the facilities that contains GENERAL in their name from the cms_data data frame.
|> filter(
cms_data str_detect(facility_name, 'GENERAL')
)
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
select()
The select()
function returns a subset of the variables or columns.
This function can accept column names (even without quotation marks) or the column position number starting from the left. Unlike in base R (we explore before), commands within the brackets in select()
do not need to be concatenated using c()
.
Example 1: Extract the facility name, hospital type and overall rating columns from cms_data data frame.
|> select(facility_name, hospital_type, overall_rating) cms_data
Output
facility_name | hospital_type | overall_rating |
---|---|---|
SCRIPPS GREEN HOSPITAL | Acute Care Hospital | 5 |
ST BERNARD HOSPITAL | Acute Care Hospital | 2 |
SOUTH LAKE HOSPITAL | Acute Care Hospital | 2 |
MERCY HOSPITAL FORT SMITH | Acute Care Hospital | 3 |
BAPTIST MEMORIAL HOSPITAL | Acute Care Hospital | 2 |
ST JOSEPH REGIONAL HEALTH CENTER | Acute Care Hospital | 3 |
GREENE COUNTY GENERAL HOSPITAL | Critical Access Hospital | 3 |
SOUTHWEST MEMORIAL HOSPITAL | Critical Access Hospital | 3 |
SENTARA GENERAL HOSPITAL | Acute Care Hospital | 3 |
PIEDMONT FAYETTE HOSPITAL | Acute Care Hospital | 2 |
MISSION COMMUNITY HOSPITAL | Acute Care Hospital | 3 |
DOCTORS HOSPITAL | Acute Care Hospital | 3 |
SUMNER REGIONAL MEDICAL CENTER | Acute Care Hospital | 2 |
WHIDBEY GENERAL HOSPITAL | Critical Access Hospital | 3 |
NORTHRIDGE MEDICAL CENTER | Acute Care Hospital | 2 |
Using column positions:
|> select(2, 4, 8) cms_data
Output
facility_name | hospital_type | overall_rating |
---|---|---|
SCRIPPS GREEN HOSPITAL | Acute Care Hospital | 5 |
ST BERNARD HOSPITAL | Acute Care Hospital | 2 |
SOUTH LAKE HOSPITAL | Acute Care Hospital | 2 |
MERCY HOSPITAL FORT SMITH | Acute Care Hospital | 3 |
BAPTIST MEMORIAL HOSPITAL | Acute Care Hospital | 2 |
ST JOSEPH REGIONAL HEALTH CENTER | Acute Care Hospital | 3 |
GREENE COUNTY GENERAL HOSPITAL | Critical Access Hospital | 3 |
SOUTHWEST MEMORIAL HOSPITAL | Critical Access Hospital | 3 |
SENTARA GENERAL HOSPITAL | Acute Care Hospital | 3 |
PIEDMONT FAYETTE HOSPITAL | Acute Care Hospital | 2 |
MISSION COMMUNITY HOSPITAL | Acute Care Hospital | 3 |
DOCTORS HOSPITAL | Acute Care Hospital | 3 |
SUMNER REGIONAL MEDICAL CENTER | Acute Care Hospital | 2 |
WHIDBEY GENERAL HOSPITAL | Critical Access Hospital | 3 |
NORTHRIDGE MEDICAL CENTER | Acute Care Hospital | 2 |
We can use the ‘-’ symbol to extract all columns except for specific ones:
|> dplyr::select(-id, -county_name, -star_rating, -no_of_surveys, -response_rate) cms_data
Output
facility_name | hospital_type | overall_rating |
---|---|---|
SCRIPPS GREEN HOSPITAL | Acute Care Hospital | 5 |
ST BERNARD HOSPITAL | Acute Care Hospital | 2 |
SOUTH LAKE HOSPITAL | Acute Care Hospital | 2 |
MERCY HOSPITAL FORT SMITH | Acute Care Hospital | 3 |
BAPTIST MEMORIAL HOSPITAL | Acute Care Hospital | 2 |
ST JOSEPH REGIONAL HEALTH CENTER | Acute Care Hospital | 3 |
GREENE COUNTY GENERAL HOSPITAL | Critical Access Hospital | 3 |
SOUTHWEST MEMORIAL HOSPITAL | Critical Access Hospital | 3 |
SENTARA GENERAL HOSPITAL | Acute Care Hospital | 3 |
PIEDMONT FAYETTE HOSPITAL | Acute Care Hospital | 2 |
MISSION COMMUNITY HOSPITAL | Acute Care Hospital | 3 |
DOCTORS HOSPITAL | Acute Care Hospital | 3 |
SUMNER REGIONAL MEDICAL CENTER | Acute Care Hospital | 2 |
WHIDBEY GENERAL HOSPITAL | Critical Access Hospital | 3 |
NORTHRIDGE MEDICAL CENTER | Acute Care Hospital | 2 |
Or use a combination of column names and positions:
|> select(2, 4, overall_rating) cms_data
Output
facility_name | hospital_type | overall_rating |
---|---|---|
SCRIPPS GREEN HOSPITAL | Acute Care Hospital | 5 |
ST BERNARD HOSPITAL | Acute Care Hospital | 2 |
SOUTH LAKE HOSPITAL | Acute Care Hospital | 2 |
MERCY HOSPITAL FORT SMITH | Acute Care Hospital | 3 |
BAPTIST MEMORIAL HOSPITAL | Acute Care Hospital | 2 |
ST JOSEPH REGIONAL HEALTH CENTER | Acute Care Hospital | 3 |
GREENE COUNTY GENERAL HOSPITAL | Critical Access Hospital | 3 |
SOUTHWEST MEMORIAL HOSPITAL | Critical Access Hospital | 3 |
SENTARA GENERAL HOSPITAL | Acute Care Hospital | 3 |
PIEDMONT FAYETTE HOSPITAL | Acute Care Hospital | 2 |
MISSION COMMUNITY HOSPITAL | Acute Care Hospital | 3 |
DOCTORS HOSPITAL | Acute Care Hospital | 3 |
SUMNER REGIONAL MEDICAL CENTER | Acute Care Hospital | 2 |
WHIDBEY GENERAL HOSPITAL | Critical Access Hospital | 3 |
NORTHRIDGE MEDICAL CENTER | Acute Care Hospital | 2 |
Useful helper functions
The select helper functions (check ?select_helpers
) are a set of convenience functions provided by the dplyr
package. These functions offer shortcuts for selecting columns based on specific criteria or patterns, making it easier to work with data frames.
Some commonly used select helper functions include:
starts_with()
: selects columns that start with a specified prefix.
|> select(starts_with('s')) cms_data
Output
star_rating |
---|
4 |
1 |
2 |
3 |
2 |
3 |
3 |
4 |
4 |
2 |
3 |
4 |
4 |
3 |
3 |
ends_with()
: selects columns that end with a specified suffix.
|> select(ends_with('g')) cms_data
Output
star_rating | overall_rating |
---|---|
4 | 5 |
1 | 2 |
2 | 2 |
3 | 3 |
2 | 2 |
3 | 3 |
3 | 3 |
4 | 3 |
4 | 3 |
2 | 2 |
3 | 3 |
4 | 3 |
4 | 2 |
3 | 3 |
3 | 2 |
contains()
: selects columns that contain a specified substring.
|> select(contains('name')) cms_data
Output
facility_name |
---|
SCRIPPS GREEN HOSPITAL |
ST BERNARD HOSPITAL |
SOUTH LAKE HOSPITAL |
MERCY HOSPITAL FORT SMITH |
BAPTIST MEMORIAL HOSPITAL |
ST JOSEPH REGIONAL HEALTH CENTER |
GREENE COUNTY GENERAL HOSPITAL |
SOUTHWEST MEMORIAL HOSPITAL |
SENTARA GENERAL HOSPITAL |
PIEDMONT FAYETTE HOSPITAL |
MISSION COMMUNITY HOSPITAL |
DOCTORS HOSPITAL |
SUMNER REGIONAL MEDICAL CENTER |
WHIDBEY GENERAL HOSPITAL |
NORTHRIDGE MEDICAL CENTER |
|> select(contains('f')) cms_data
Output
facility_name | no_of_surveys |
---|---|
SCRIPPS GREEN HOSPITAL | 3110 |
ST BERNARD HOSPITAL | 264 |
SOUTH LAKE HOSPITAL | 1382 |
MERCY HOSPITAL FORT SMITH | 2506 |
BAPTIST MEMORIAL HOSPITAL | 1799 |
ST JOSEPH REGIONAL HEALTH CENTER | 1379 |
GREENE COUNTY GENERAL HOSPITAL | 114 |
SOUTHWEST MEMORIAL HOSPITAL | 247 |
SENTARA GENERAL HOSPITAL | 619 |
PIEDMONT FAYETTE HOSPITAL | 1714 |
MISSION COMMUNITY HOSPITAL | 241 |
DOCTORS HOSPITAL | 393 |
SUMNER REGIONAL MEDICAL CENTER | 680 |
WHIDBEY GENERAL HOSPITAL | 389 |
NORTHRIDGE MEDICAL CENTER | 1110 |
matches()
: selects columns that match a specified regular expression pattern.
|> select(
cms_data matches('[a-z]_[a-z]{4}$')
)
Output
facility_name | hospital_type | response_rate |
---|---|---|
SCRIPPS GREEN HOSPITAL | Acute Care Hospital | 41 |
ST BERNARD HOSPITAL | Acute Care Hospital | 6 |
SOUTH LAKE HOSPITAL | Acute Care Hospital | 20 |
MERCY HOSPITAL FORT SMITH | Acute Care Hospital | 35 |
BAPTIST MEMORIAL HOSPITAL | Acute Care Hospital | 18 |
ST JOSEPH REGIONAL HEALTH CENTER | Acute Care Hospital | 24 |
GREENE COUNTY GENERAL HOSPITAL | Critical Access Hospital | 22 |
SOUTHWEST MEMORIAL HOSPITAL | Critical Access Hospital | 34 |
SENTARA GENERAL HOSPITAL | Acute Care Hospital | 32 |
PIEDMONT FAYETTE HOSPITAL | Acute Care Hospital | 21 |
MISSION COMMUNITY HOSPITAL | Acute Care Hospital | 14 |
DOCTORS HOSPITAL | Acute Care Hospital | 24 |
SUMNER REGIONAL MEDICAL CENTER | Acute Care Hospital | 35 |
WHIDBEY GENERAL HOSPITAL | Critical Access Hospital | 29 |
NORTHRIDGE MEDICAL CENTER | Acute Care Hospital | 20 |
Here, the regular expression [a-z]_[a-z]{4}$
can be broken down into smaller chunks for better understanding:
[a-z]
matches a set of lowercase characters from ‘a’ to ‘z’._
matches an underscore.[a-z]{4}
matches any four lowercase characters from ‘a’ to ‘z’.
Putting this together, the expression selects column names that have four characters after an underscore. Thus, it should match column names: facility_name, county_name, hospital_type, and response_rate.
If you’re unfamiliar with regular expressions, you can skip this section for now. However, interested readers can find many online resources to learn about regular expressions. One of my favorite online tools for building and testing regular expressions is https://regexr.com. You can use this tool to test the correctness of a regular expression.
num_range()
: selects columns based on a numeric range.
Let’s use the count data frame for this example. First read the csv file: GSE60450_normalized_data.csv.
<- read_csv("data/GSE60450_normalized_data.csv")
counts colnames(counts)
Output
Rows: 20 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): X, gene_symbol
dbl (12): GSM1480291, GSM1480292, GSM1480293, GSM1480294, GSM1480295, GSM148...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "X" "gene_symbol" "GSM1480291" "GSM1480292" "GSM1480293"
[6] "GSM1480294" "GSM1480295" "GSM1480296" "GSM1480297" "GSM1480298"
[11] "GSM1480299" "GSM1480300" "GSM1480301" "GSM1480302"
To select the samples from GSM1480297 to GSM1480300:
|> select(
counts num_range(prefix = "GSM1480", 297:300)
)
Output
GSM1480297 | GSM1480298 | GSM1480299 | GSM1480300 |
---|---|---|---|
10.59006 | 14.88337 | 7.57182 | 7.05763 |
95.67017 | 100.73912 | 78.07470 | 59.35009 |
0.00000 | 0.00000 | 0.00000 | 0.00000 |
125.00263 | 102.99289 | 110.64124 | 97.98130 |
0.00000 | 0.08505 | 0.07726 | 0.08255 |
33.84822 | 36.91075 | 27.46715 | 31.36722 |
0.00000 | 0.00000 | 0.00000 | 0.00000 |
281.97525 | 241.83343 | 227.84988 | 232.81909 |
62.62120 | 61.82975 | 49.98944 | 42.01557 |
17.98312 | 13.69270 | 9.00119 | 9.69908 |
51.79137 | 49.49782 | 52.84819 | 47.00956 |
50.07299 | 50.94364 | 45.04458 | 36.89776 |
147.58143 | 128.20957 | 168.08661 | 133.35197 |
52.83040 | 84.66509 | 323.54064 | 485.86178 |
0.07992 | 0.00000 | 0.00000 | 0.00000 |
0.00000 | 0.00000 | 0.03863 | 0.00000 |
102.26398 | 81.05056 | 115.31568 | 138.38724 |
0.03996 | 0.00000 | 0.00000 | 0.00000 |
0.00000 | 0.00000 | 0.00000 | 0.00000 |
14.02683 | 11.56650 | 12.13036 | 12.21671 |
all_of()
: selects columns specified by character vector.
|> select(
cms_data all_of(c("star_rating", "no_of_surveys", "response_rate", "no_column_by_this_name"))
)
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `no_column_by_this_name` doesn't exist.
When using all_of()
, all the names provided by the vector must be present in the data frame. Otherwise, it will result in an error, as shown above.
|> select(
cms_data all_of(c("star_rating", "no_of_surveys", "response_rate"))
)
Output
star_rating | no_of_surveys | response_rate |
---|---|---|
4 | 3110 | 41 |
1 | 264 | 6 |
2 | 1382 | 20 |
3 | 2506 | 35 |
2 | 1799 | 18 |
3 | 1379 | 24 |
3 | 114 | 22 |
4 | 247 | 34 |
4 | 619 | 32 |
2 | 1714 | 21 |
3 | 241 | 14 |
4 | 393 | 24 |
4 | 680 | 35 |
3 | 389 | 29 |
3 | 1110 | 20 |
any_of()
: selects columns specified by character vector, allowing any of them to be present.
|> select(
cms_data any_of(c("star_rating", "no_of_surveys", "response_rate", "no_column_by_this_name"))
)
Output
star_rating | no_of_surveys | response_rate |
---|---|---|
4 | 3110 | 41 |
1 | 264 | 6 |
2 | 1382 | 20 |
3 | 2506 | 35 |
2 | 1799 | 18 |
3 | 1379 | 24 |
3 | 114 | 22 |
4 | 247 | 34 |
4 | 619 | 32 |
2 | 1714 | 21 |
3 | 241 | 14 |
4 | 393 | 24 |
4 | 680 | 35 |
3 | 389 | 29 |
3 | 1110 | 20 |
everything()
: Selects all columns.
This function returns all column names that have not been specified. It is often used when reordering all columns in a dataframe:
|> select(5, 8, 2, everything()) cms_data
Output
star_rating | overall_rating | facility_name | id | county | hospital_type | no_of_surveys | response_rate |
---|---|---|---|---|---|---|---|
4 | 5 | SCRIPPS GREEN HOSPITAL | 050424 | SAN DIEGO | Acute Care Hospital | 3110 | 41 |
1 | 2 | ST BERNARD HOSPITAL | 140103 | COOK | Acute Care Hospital | 264 | 6 |
2 | 2 | SOUTH LAKE HOSPITAL | 100051 | LAKE | Acute Care Hospital | 1382 | 20 |
3 | 3 | MERCY HOSPITAL FORT SMITH | 040062 | SEBASTIAN | Acute Care Hospital | 2506 | 35 |
2 | 2 | BAPTIST MEMORIAL HOSPITAL | 440048 | SHELBY | Acute Care Hospital | 1799 | 18 |
3 | 3 | ST JOSEPH REGIONAL HEALTH CENTER | 450011 | BRAZOS | Acute Care Hospital | 1379 | 24 |
3 | 3 | GREENE COUNTY GENERAL HOSPITAL | 151317 | GREENE | Critical Access Hospital | 114 | 22 |
4 | 3 | SOUTHWEST MEMORIAL HOSPITAL | 061327 | MONTEZUMA | Critical Access Hospital | 247 | 34 |
4 | 3 | SENTARA GENERAL HOSPITAL | 490057 | VIRGINIA BEACH | Acute Care Hospital | 619 | 32 |
2 | 2 | PIEDMONT FAYETTE HOSPITAL | 110215 | FAYETTE | Acute Care Hospital | 1714 | 21 |
3 | 3 | MISSION COMMUNITY HOSPITAL | 050704 | LOS ANGELES | Acute Care Hospital | 241 | 14 |
4 | 3 | DOCTORS HOSPITAL | 100296 | MIAMI-DADE | Acute Care Hospital | 393 | 24 |
4 | 2 | SUMNER REGIONAL MEDICAL CENTER | 440003 | SUMNER | Acute Care Hospital | 680 | 35 |
3 | 3 | WHIDBEY GENERAL HOSPITAL | 501339 | ISLAND | Critical Access Hospital | 389 | 29 |
3 | 2 | NORTHRIDGE MEDICAL CENTER | 050116 | LOS ANGELES | Acute Care Hospital | 1110 | 20 |
Here the dimensions of the dataframe is not changed, merely the column order.
You can combine multiple helper functions to create more complex selection criteria. Additionally, you can use the ‘-’ symbol in front of the helper function to exclude the matched columns.
For example try the following examples:
|> select(starts_with('i'), contains('rating'))
cms_data |> select(ends_with("type"), everything(), -1, -3) cms_data
mutate()
The mutate()
function adds new columns of data, thus ‘mutating’ the contents and dimensions of the input data frame.
Example 1: Calculate the total number of patients or visitors who responded to the survey in each facility (i.e, \(\text{response rate } = \frac{\text{number of responses}}{\text{total number of surveys}} \times 100\)).
Here we use the round()
function to round off the result to the closest integer or numeric value as number of responses cannot contain decimal values.
|>
cms_data mutate(no_of_responses = round(no_of_surveys * response_rate / 100))
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating | no_of_responses |
---|---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 | 127510 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 | 1584 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 | 27640 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 | 87710 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 | 32382 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 | 33096 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 | 2508 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 | 8398 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 | 19808 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 | 35994 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 | 3374 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 | 9432 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 | 23800 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 | 11281 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 | 22200 |
This creates a new column at the end of the data frame named ‘no_of_responses’ and computes the total number of responses. Because the number of columns is expanding, we can reduce the number of columns displayed using the select()
function.
To do this, we need to use chaining which is discussed below.
Chaining functions
R chaining allows you to streamline your data analysis workflow by sequentially applying multiple operations to your data using the pipe operator |>
. We often need to perform several data manipulation or analysis operations in a sequence. Chaining allows you to apply these operations one after the other in a clear and concise manner.
Here’s a basic template for chaining operations using the pipe operator |>
:
<- data |>
result operation1(...) |>
operation2(...) |>
operation3(...) |>
...operationN(...)
In this template:
data
represents the input data frame or object.operation1
,operation2
, …,operationN
represent the functions or operations you want to apply sequentially to the data.For example:select()
,filter()
ormutate()
functions....
represents any additional arguments or parameters that may be passed to each operation.
Each operation takes the output of the previous operation as its input, making it easy to chain multiple operations together. This improves the readability of your code by organizing operations in a left-to-right fashion and it avoids creating intermediate variables to store the results of each operation.
mutate()
continued
Let’s use chaining to combine both select()
and mutate()
operations for the previos example:
|>
cms_data select(facility_name, no_of_surveys, response_rate) |>
mutate(no_of_responses = round(no_of_surveys * response_rate / 100))
Output
facility_name | no_of_surveys | response_rate | no_of_responses |
---|---|---|---|
SCRIPPS GREEN HOSPITAL | 3110 | 41 | 127510 |
ST BERNARD HOSPITAL | 264 | 6 | 1584 |
SOUTH LAKE HOSPITAL | 1382 | 20 | 27640 |
MERCY HOSPITAL FORT SMITH | 2506 | 35 | 87710 |
BAPTIST MEMORIAL HOSPITAL | 1799 | 18 | 32382 |
ST JOSEPH REGIONAL HEALTH CENTER | 1379 | 24 | 33096 |
GREENE COUNTY GENERAL HOSPITAL | 114 | 22 | 2508 |
SOUTHWEST MEMORIAL HOSPITAL | 247 | 34 | 8398 |
SENTARA GENERAL HOSPITAL | 619 | 32 | 19808 |
PIEDMONT FAYETTE HOSPITAL | 1714 | 21 | 35994 |
MISSION COMMUNITY HOSPITAL | 241 | 14 | 3374 |
DOCTORS HOSPITAL | 393 | 24 | 9432 |
SUMNER REGIONAL MEDICAL CENTER | 680 | 35 | 23800 |
WHIDBEY GENERAL HOSPITAL | 389 | 29 | 11281 |
NORTHRIDGE MEDICAL CENTER | 1110 | 20 | 22200 |
summarise()
The summarise()
function creates individual summary statistics from larger data sets.
The output of summarise()/summarize()
differs qualitatively from the input. It results in a smaller dataframe with a reduced representation of the original data. While not strictly necessary, it’s advisable to assign new column names for the summary statistics generated by this function. This practice enhances clarity and organization in your data analysis workflow.
Example 1: Calculate the mean number of surveys.
|>
cms_data summarise(mean_no_of_surveys = mean(no_of_surveys))
Output
mean_no_of_surveys |
---|
1063.133 |
This results in a data frame of size 1 row \(\times\) 1 col. We can create additional summary statistics by adding them in a comma-separated sequence as follows:
|>
cms_data summarise(mean_no_of_surveys = mean(no_of_surveys),
min_no_of_surveys = min(no_of_surveys),
max_no_of_surveys = max(no_of_surveys),
tot_no_of_surveys = sum(no_of_surveys))
Output
mean_no_of_surveys | min_no_of_surveys | max_no_of_surveys | tot_no_of_surveys |
---|---|---|---|
1063.133 | 114 | 3110 | 15947 |
n()
helper function
This function counts the number of observations in a dataset. It does not take any arguments, but simply counts the rows.
|>
cms_data summarise(mean_no_of_surveys = mean(no_of_surveys),
min_no_of_surveys = min(no_of_surveys),
max_no_of_surveys = max(no_of_surveys),
tot_no_of_surveys = sum(no_of_surveys),
n_rows = n())
Output
mean_no_of_surveys | min_no_of_surveys | max_no_of_surveys | tot_no_of_surveys | n_rows |
---|---|---|---|---|
1063.133 | 114 | 3110 | 15947 | 15 |
arrange()
The arrange()
function orders rows based on the values in a given column.
Example 1: Order the facilities based on the county.
|>
cms_data arrange(county)
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
Example 2: Sort the facilities based on the overall rating first and then by response rate.
|>
cms_data arrange(overall_rating, response_rate)
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
desc()
helper function
This function is used to sort data in descending order.
Example 3: Sort the facilities in descending order based on the number of surveys
|>
cms_data arrange(desc(no_of_surveys))
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
group_by()
The group_by()
function groups data by one or more variables. It allows us to create sub groups based on labels in a particular column and to run subsequent functions or operations on all sub groups.
The group_by()
function essentially partitions the data into separate subsets, each corresponding to a distinct category in a specified column. To observe this in action, inspect the structure using str()
of the cms_data dataset before and after grouping:
|> str() cms_data
Output
spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ id : chr [1:15] "050424" "140103" "100051" "040062" ...
$ facility_name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
$ county : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
$ hospital_type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
$ star_rating : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
$ no_of_surveys : num [1:15] 3110 264 1382 2506 1799 ...
$ response_rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
$ overall_rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
- attr(*, "spec")=
.. cols(
.. ID = col_character(),
.. `Facility Name` = col_character(),
.. County = col_character(),
.. `Hospital Type` = col_character(),
.. `Star Rating` = col_double(),
.. `No of Surveys` = col_double(),
.. `Response Rate` = col_double(),
.. `Overall Rating` = col_double()
.. )
- attr(*, "problems")=<externalptr>
|> group_by(hospital_type) |> str() cms_data
Output
gropd_df [15 × 8] (S3: grouped_df/tbl_df/tbl/data.frame)
$ id : chr [1:15] "050424" "140103" "100051" "040062" ...
$ facility_name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
$ county : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
$ hospital_type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
$ star_rating : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
$ no_of_surveys : num [1:15] 3110 264 1382 2506 1799 ...
$ response_rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
$ overall_rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
- attr(*, "spec")=
.. cols(
.. ID = col_character(),
.. `Facility Name` = col_character(),
.. County = col_character(),
.. `Hospital Type` = col_character(),
.. `Star Rating` = col_double(),
.. `No of Surveys` = col_double(),
.. `Response Rate` = col_double(),
.. `Overall Rating` = col_double()
.. )
- attr(*, "problems")=<externalptr>
- attr(*, "groups")= tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
..$ hospital_type: chr [1:2] "Acute Care Hospital" "Critical Access Hospital"
..$ .rows : list<int> [1:2]
.. ..$ : int [1:12] 1 2 3 4 5 6 9 10 11 12 ...
.. ..$ : int [1:3] 7 8 14
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
The result of applying group_by()
is a ‘grouped_df’ (grouped data frame) and all subsequent functions are executed independently on each subgroup of the data.
ungroup()
helper
The ungroup()
function is used to remove grouping from a data frame or a grouped data frame created using the group_by()
function.
When you apply group_by()
to a data frame, it creates a grouped data frame where operations like summarization or manipulation are performed within each group defined by the grouping variables. However, in some cases, you may want to remove the grouping structure and return to the original ungrouped data frame. This is where the ungroup()
function comes into play.
Combining multiple dplyr
functions
In this section, we will be using the Australian_Cancer_Incidence_and_Mortality.csv dataset.
<- read_csv("data/Australian_Cancer_Incidence_and_Mortality.csv") cancer_mort
First, let’s examine the dimensions of this dataset.
dim(cancer_mort)
Output
[1] 119862 6
Next, let’s take a look at the top few rows of this data frame.
head(cancer_mort)
Output
Year | Sex | Type | Cancer_Type | Age | Count |
---|---|---|---|---|---|
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 0-4 | 42 |
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 5-9 | 25 |
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 10-14 | 14 |
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 15-19 | 14 |
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 20-24 | 5 |
1982 | Male | Incidence | Acute lymphoblastic leukaemia | 25-29 | 2 |
count()
helper
The count()
function is used to count the number of occurrences of unique values in one or more variables within a data frame. This function is particularly useful for summarizing data and understanding the distribution of values within a dataset.
It is a convenient function that combines group_by()
and summarize()
in one step, particularly useful for counting occurrences of character data.
Example 1: Count the number of cancers observed in each cancer type.
|> count(Cancer_Type) cancer_mort
Output
Cancer_Type | n |
---|---|
Acute lymphoblastic leukaemia | 3942 |
Acute myeloid leukaemia | 3942 |
Anal cancer | 3942 |
Bladder cancer | 3942 |
Bowel cancer | 3942 |
Brain cancer | 3942 |
Breast cancer | 3942 |
Cervical cancer | 1314 |
Chronic lymphocytic leukaemia | 3942 |
Chronic myeloid leukaemia | 3942 |
Colon cancer | 3942 |
Head and neck excluding lip | 3942 |
Head and neck including lip | 3942 |
Hodgkin lymphoma | 3942 |
Kidney cancer | 3942 |
Laryngeal cancer | 3942 |
Liver cancer | 3942 |
Lung cancer | 3942 |
Melanoma of the skin | 3942 |
Mesothelioma | 3942 |
Myeloma | 3942 |
Non-Hodgkin lymphoma | 3942 |
Non-melanoma skin cancer, all types | 2376 |
Non-melanoma skin cancer, rare types | 540 |
Oesophageal cancer | 3942 |
Ovarian cancer | 1314 |
Pancreatic cancer | 3942 |
Prostate cancer | 1314 |
Rectal cancer | 3942 |
Stomach cancer | 3942 |
Testicular cancer | 1314 |
Thyroid cancer | 3942 |
Tongue cancer | 3942 |
Unknown primary site | 3942 |
Uterine cancer | 1314 |
In the count summary output column is typically denoted as ‘n’. The same output can be observed by combining group_by()
and summarise()
functions as follows.
|>
cancer_mort group_by(Cancer_Type) |>
summarise(n = n())
Output
Cancer_Type | n |
---|---|
Acute lymphoblastic leukaemia | 3942 |
Acute myeloid leukaemia | 3942 |
Anal cancer | 3942 |
Bladder cancer | 3942 |
Bowel cancer | 3942 |
Brain cancer | 3942 |
Breast cancer | 3942 |
Cervical cancer | 1314 |
Chronic lymphocytic leukaemia | 3942 |
Chronic myeloid leukaemia | 3942 |
Colon cancer | 3942 |
Head and neck excluding lip | 3942 |
Head and neck including lip | 3942 |
Hodgkin lymphoma | 3942 |
Kidney cancer | 3942 |
Laryngeal cancer | 3942 |
Liver cancer | 3942 |
Lung cancer | 3942 |
Melanoma of the skin | 3942 |
Mesothelioma | 3942 |
Myeloma | 3942 |
Non-Hodgkin lymphoma | 3942 |
Non-melanoma skin cancer, all types | 2376 |
Non-melanoma skin cancer, rare types | 540 |
Oesophageal cancer | 3942 |
Ovarian cancer | 1314 |
Pancreatic cancer | 3942 |
Prostate cancer | 1314 |
Rectal cancer | 3942 |
Stomach cancer | 3942 |
Testicular cancer | 1314 |
Thyroid cancer | 3942 |
Tongue cancer | 3942 |
Unknown primary site | 3942 |
Uterine cancer | 1314 |
Example 2: Count the number of cancers observed in each cancer type and age group.
|> count(Cancer_Type, Age) cancer_mort
Output
Cancer_Type | Age | n |
---|---|---|
Acute lymphoblastic leukaemia | 0-4 | 219 |
Acute lymphoblastic leukaemia | 10-14 | 219 |
Acute lymphoblastic leukaemia | 15-19 | 219 |
Acute lymphoblastic leukaemia | 20-24 | 219 |
Acute lymphoblastic leukaemia | 25-29 | 219 |
Acute lymphoblastic leukaemia | 30-34 | 219 |
Acute lymphoblastic leukaemia | 35-39 | 219 |
Acute lymphoblastic leukaemia | 40-44 | 219 |
Acute lymphoblastic leukaemia | 45-49 | 219 |
Acute lymphoblastic leukaemia | 5-9 | 219 |
Acute lymphoblastic leukaemia | 50-54 | 219 |
Acute lymphoblastic leukaemia | 55-59 | 219 |
Acute lymphoblastic leukaemia | 60-64 | 219 |
Acute lymphoblastic leukaemia | 65-69 | 219 |
Acute lymphoblastic leukaemia | 70-74 | 219 |
Acute lymphoblastic leukaemia | 75-79 | 219 |
Acute lymphoblastic leukaemia | 80-84 | 219 |
Acute lymphoblastic leukaemia | 85+ | 219 |
Acute myeloid leukaemia | 0-4 | 219 |
Acute myeloid leukaemia | 10-14 | 219 |
Acute myeloid leukaemia | 15-19 | 219 |
Acute myeloid leukaemia | 20-24 | 219 |
Acute myeloid leukaemia | 25-29 | 219 |
Acute myeloid leukaemia | 30-34 | 219 |
Acute myeloid leukaemia | 35-39 | 219 |
Acute myeloid leukaemia | 40-44 | 219 |
Acute myeloid leukaemia | 45-49 | 219 |
Acute myeloid leukaemia | 5-9 | 219 |
Acute myeloid leukaemia | 50-54 | 219 |
Acute myeloid leukaemia | 55-59 | 219 |
Acute myeloid leukaemia | 60-64 | 219 |
Acute myeloid leukaemia | 65-69 | 219 |
Acute myeloid leukaemia | 70-74 | 219 |
Acute myeloid leukaemia | 75-79 | 219 |
Acute myeloid leukaemia | 80-84 | 219 |
Acute myeloid leukaemia | 85+ | 219 |
Anal cancer | 0-4 | 219 |
Anal cancer | 10-14 | 219 |
Anal cancer | 15-19 | 219 |
Anal cancer | 20-24 | 219 |
Anal cancer | 25-29 | 219 |
Anal cancer | 30-34 | 219 |
Anal cancer | 35-39 | 219 |
Anal cancer | 40-44 | 219 |
Anal cancer | 45-49 | 219 |
Anal cancer | 5-9 | 219 |
Anal cancer | 50-54 | 219 |
Anal cancer | 55-59 | 219 |
Anal cancer | 60-64 | 219 |
Anal cancer | 65-69 | 219 |
Anal cancer | 70-74 | 219 |
Anal cancer | 75-79 | 219 |
Anal cancer | 80-84 | 219 |
Anal cancer | 85+ | 219 |
Bladder cancer | 0-4 | 219 |
Bladder cancer | 10-14 | 219 |
Bladder cancer | 15-19 | 219 |
Bladder cancer | 20-24 | 219 |
Bladder cancer | 25-29 | 219 |
Bladder cancer | 30-34 | 219 |
Bladder cancer | 35-39 | 219 |
Bladder cancer | 40-44 | 219 |
Bladder cancer | 45-49 | 219 |
Bladder cancer | 5-9 | 219 |
Bladder cancer | 50-54 | 219 |
Bladder cancer | 55-59 | 219 |
Bladder cancer | 60-64 | 219 |
Bladder cancer | 65-69 | 219 |
Bladder cancer | 70-74 | 219 |
Bladder cancer | 75-79 | 219 |
Bladder cancer | 80-84 | 219 |
Bladder cancer | 85+ | 219 |
Bowel cancer | 0-4 | 219 |
Bowel cancer | 10-14 | 219 |
Bowel cancer | 15-19 | 219 |
Bowel cancer | 20-24 | 219 |
Bowel cancer | 25-29 | 219 |
Bowel cancer | 30-34 | 219 |
Bowel cancer | 35-39 | 219 |
Bowel cancer | 40-44 | 219 |
Bowel cancer | 45-49 | 219 |
Bowel cancer | 5-9 | 219 |
Bowel cancer | 50-54 | 219 |
Bowel cancer | 55-59 | 219 |
Bowel cancer | 60-64 | 219 |
Bowel cancer | 65-69 | 219 |
Bowel cancer | 70-74 | 219 |
Bowel cancer | 75-79 | 219 |
Bowel cancer | 80-84 | 219 |
Bowel cancer | 85+ | 219 |
Brain cancer | 0-4 | 219 |
Brain cancer | 10-14 | 219 |
Brain cancer | 15-19 | 219 |
Brain cancer | 20-24 | 219 |
Brain cancer | 25-29 | 219 |
Brain cancer | 30-34 | 219 |
Brain cancer | 35-39 | 219 |
Brain cancer | 40-44 | 219 |
Brain cancer | 45-49 | 219 |
Brain cancer | 5-9 | 219 |
Brain cancer | 50-54 | 219 |
Brain cancer | 55-59 | 219 |
Brain cancer | 60-64 | 219 |
Brain cancer | 65-69 | 219 |
Brain cancer | 70-74 | 219 |
Brain cancer | 75-79 | 219 |
Brain cancer | 80-84 | 219 |
Brain cancer | 85+ | 219 |
Breast cancer | 0-4 | 219 |
Breast cancer | 10-14 | 219 |
Breast cancer | 15-19 | 219 |
Breast cancer | 20-24 | 219 |
Breast cancer | 25-29 | 219 |
Breast cancer | 30-34 | 219 |
Breast cancer | 35-39 | 219 |
Breast cancer | 40-44 | 219 |
Breast cancer | 45-49 | 219 |
Breast cancer | 5-9 | 219 |
Breast cancer | 50-54 | 219 |
Breast cancer | 55-59 | 219 |
Breast cancer | 60-64 | 219 |
Breast cancer | 65-69 | 219 |
Breast cancer | 70-74 | 219 |
Breast cancer | 75-79 | 219 |
Breast cancer | 80-84 | 219 |
Breast cancer | 85+ | 219 |
Cervical cancer | 0-4 | 73 |
Cervical cancer | 10-14 | 73 |
Cervical cancer | 15-19 | 73 |
Cervical cancer | 20-24 | 73 |
Cervical cancer | 25-29 | 73 |
Cervical cancer | 30-34 | 73 |
Cervical cancer | 35-39 | 73 |
Cervical cancer | 40-44 | 73 |
Cervical cancer | 45-49 | 73 |
Cervical cancer | 5-9 | 73 |
Cervical cancer | 50-54 | 73 |
Cervical cancer | 55-59 | 73 |
Cervical cancer | 60-64 | 73 |
Cervical cancer | 65-69 | 73 |
Cervical cancer | 70-74 | 73 |
Cervical cancer | 75-79 | 73 |
Cervical cancer | 80-84 | 73 |
Cervical cancer | 85+ | 73 |
Chronic lymphocytic leukaemia | 0-4 | 219 |
Chronic lymphocytic leukaemia | 10-14 | 219 |
Chronic lymphocytic leukaemia | 15-19 | 219 |
Chronic lymphocytic leukaemia | 20-24 | 219 |
Chronic lymphocytic leukaemia | 25-29 | 219 |
Chronic lymphocytic leukaemia | 30-34 | 219 |
Chronic lymphocytic leukaemia | 35-39 | 219 |
Chronic lymphocytic leukaemia | 40-44 | 219 |
Chronic lymphocytic leukaemia | 45-49 | 219 |
Chronic lymphocytic leukaemia | 5-9 | 219 |
Chronic lymphocytic leukaemia | 50-54 | 219 |
Chronic lymphocytic leukaemia | 55-59 | 219 |
Chronic lymphocytic leukaemia | 60-64 | 219 |
Chronic lymphocytic leukaemia | 65-69 | 219 |
Chronic lymphocytic leukaemia | 70-74 | 219 |
Chronic lymphocytic leukaemia | 75-79 | 219 |
Chronic lymphocytic leukaemia | 80-84 | 219 |
Chronic lymphocytic leukaemia | 85+ | 219 |
Chronic myeloid leukaemia | 0-4 | 219 |
Chronic myeloid leukaemia | 10-14 | 219 |
Chronic myeloid leukaemia | 15-19 | 219 |
Chronic myeloid leukaemia | 20-24 | 219 |
Chronic myeloid leukaemia | 25-29 | 219 |
Chronic myeloid leukaemia | 30-34 | 219 |
Chronic myeloid leukaemia | 35-39 | 219 |
Chronic myeloid leukaemia | 40-44 | 219 |
Chronic myeloid leukaemia | 45-49 | 219 |
Chronic myeloid leukaemia | 5-9 | 219 |
Chronic myeloid leukaemia | 50-54 | 219 |
Chronic myeloid leukaemia | 55-59 | 219 |
Chronic myeloid leukaemia | 60-64 | 219 |
Chronic myeloid leukaemia | 65-69 | 219 |
Chronic myeloid leukaemia | 70-74 | 219 |
Chronic myeloid leukaemia | 75-79 | 219 |
Chronic myeloid leukaemia | 80-84 | 219 |
Chronic myeloid leukaemia | 85+ | 219 |
Colon cancer | 0-4 | 219 |
Colon cancer | 10-14 | 219 |
Colon cancer | 15-19 | 219 |
Colon cancer | 20-24 | 219 |
Colon cancer | 25-29 | 219 |
Colon cancer | 30-34 | 219 |
Colon cancer | 35-39 | 219 |
Colon cancer | 40-44 | 219 |
Colon cancer | 45-49 | 219 |
Colon cancer | 5-9 | 219 |
Colon cancer | 50-54 | 219 |
Colon cancer | 55-59 | 219 |
Colon cancer | 60-64 | 219 |
Colon cancer | 65-69 | 219 |
Colon cancer | 70-74 | 219 |
Colon cancer | 75-79 | 219 |
Colon cancer | 80-84 | 219 |
Colon cancer | 85+ | 219 |
Head and neck excluding lip | 0-4 | 219 |
Head and neck excluding lip | 10-14 | 219 |
Head and neck excluding lip | 15-19 | 219 |
Head and neck excluding lip | 20-24 | 219 |
Head and neck excluding lip | 25-29 | 219 |
Head and neck excluding lip | 30-34 | 219 |
Head and neck excluding lip | 35-39 | 219 |
Head and neck excluding lip | 40-44 | 219 |
Head and neck excluding lip | 45-49 | 219 |
Head and neck excluding lip | 5-9 | 219 |
Head and neck excluding lip | 50-54 | 219 |
Head and neck excluding lip | 55-59 | 219 |
Head and neck excluding lip | 60-64 | 219 |
Head and neck excluding lip | 65-69 | 219 |
Head and neck excluding lip | 70-74 | 219 |
Head and neck excluding lip | 75-79 | 219 |
Head and neck excluding lip | 80-84 | 219 |
Head and neck excluding lip | 85+ | 219 |
Head and neck including lip | 0-4 | 219 |
Head and neck including lip | 10-14 | 219 |
Head and neck including lip | 15-19 | 219 |
Head and neck including lip | 20-24 | 219 |
Head and neck including lip | 25-29 | 219 |
Head and neck including lip | 30-34 | 219 |
Head and neck including lip | 35-39 | 219 |
Head and neck including lip | 40-44 | 219 |
Head and neck including lip | 45-49 | 219 |
Head and neck including lip | 5-9 | 219 |
Head and neck including lip | 50-54 | 219 |
Head and neck including lip | 55-59 | 219 |
Head and neck including lip | 60-64 | 219 |
Head and neck including lip | 65-69 | 219 |
Head and neck including lip | 70-74 | 219 |
Head and neck including lip | 75-79 | 219 |
Head and neck including lip | 80-84 | 219 |
Head and neck including lip | 85+ | 219 |
Hodgkin lymphoma | 0-4 | 219 |
Hodgkin lymphoma | 10-14 | 219 |
Hodgkin lymphoma | 15-19 | 219 |
Hodgkin lymphoma | 20-24 | 219 |
Hodgkin lymphoma | 25-29 | 219 |
Hodgkin lymphoma | 30-34 | 219 |
Hodgkin lymphoma | 35-39 | 219 |
Hodgkin lymphoma | 40-44 | 219 |
Hodgkin lymphoma | 45-49 | 219 |
Hodgkin lymphoma | 5-9 | 219 |
Hodgkin lymphoma | 50-54 | 219 |
Hodgkin lymphoma | 55-59 | 219 |
Hodgkin lymphoma | 60-64 | 219 |
Hodgkin lymphoma | 65-69 | 219 |
Hodgkin lymphoma | 70-74 | 219 |
Hodgkin lymphoma | 75-79 | 219 |
Hodgkin lymphoma | 80-84 | 219 |
Hodgkin lymphoma | 85+ | 219 |
Kidney cancer | 0-4 | 219 |
Kidney cancer | 10-14 | 219 |
Kidney cancer | 15-19 | 219 |
Kidney cancer | 20-24 | 219 |
Kidney cancer | 25-29 | 219 |
Kidney cancer | 30-34 | 219 |
Kidney cancer | 35-39 | 219 |
Kidney cancer | 40-44 | 219 |
Kidney cancer | 45-49 | 219 |
Kidney cancer | 5-9 | 219 |
Kidney cancer | 50-54 | 219 |
Kidney cancer | 55-59 | 219 |
Kidney cancer | 60-64 | 219 |
Kidney cancer | 65-69 | 219 |
Kidney cancer | 70-74 | 219 |
Kidney cancer | 75-79 | 219 |
Kidney cancer | 80-84 | 219 |
Kidney cancer | 85+ | 219 |
Laryngeal cancer | 0-4 | 219 |
Laryngeal cancer | 10-14 | 219 |
Laryngeal cancer | 15-19 | 219 |
Laryngeal cancer | 20-24 | 219 |
Laryngeal cancer | 25-29 | 219 |
Laryngeal cancer | 30-34 | 219 |
Laryngeal cancer | 35-39 | 219 |
Laryngeal cancer | 40-44 | 219 |
Laryngeal cancer | 45-49 | 219 |
Laryngeal cancer | 5-9 | 219 |
Laryngeal cancer | 50-54 | 219 |
Laryngeal cancer | 55-59 | 219 |
Laryngeal cancer | 60-64 | 219 |
Laryngeal cancer | 65-69 | 219 |
Laryngeal cancer | 70-74 | 219 |
Laryngeal cancer | 75-79 | 219 |
Laryngeal cancer | 80-84 | 219 |
Laryngeal cancer | 85+ | 219 |
Liver cancer | 0-4 | 219 |
Liver cancer | 10-14 | 219 |
Liver cancer | 15-19 | 219 |
Liver cancer | 20-24 | 219 |
Liver cancer | 25-29 | 219 |
Liver cancer | 30-34 | 219 |
Liver cancer | 35-39 | 219 |
Liver cancer | 40-44 | 219 |
Liver cancer | 45-49 | 219 |
Liver cancer | 5-9 | 219 |
Liver cancer | 50-54 | 219 |
Liver cancer | 55-59 | 219 |
Liver cancer | 60-64 | 219 |
Liver cancer | 65-69 | 219 |
Liver cancer | 70-74 | 219 |
Liver cancer | 75-79 | 219 |
Liver cancer | 80-84 | 219 |
Liver cancer | 85+ | 219 |
Lung cancer | 0-4 | 219 |
Lung cancer | 10-14 | 219 |
Lung cancer | 15-19 | 219 |
Lung cancer | 20-24 | 219 |
Lung cancer | 25-29 | 219 |
Lung cancer | 30-34 | 219 |
Lung cancer | 35-39 | 219 |
Lung cancer | 40-44 | 219 |
Lung cancer | 45-49 | 219 |
Lung cancer | 5-9 | 219 |
Lung cancer | 50-54 | 219 |
Lung cancer | 55-59 | 219 |
Lung cancer | 60-64 | 219 |
Lung cancer | 65-69 | 219 |
Lung cancer | 70-74 | 219 |
Lung cancer | 75-79 | 219 |
Lung cancer | 80-84 | 219 |
Lung cancer | 85+ | 219 |
Melanoma of the skin | 0-4 | 219 |
Melanoma of the skin | 10-14 | 219 |
Melanoma of the skin | 15-19 | 219 |
Melanoma of the skin | 20-24 | 219 |
Melanoma of the skin | 25-29 | 219 |
Melanoma of the skin | 30-34 | 219 |
Melanoma of the skin | 35-39 | 219 |
Melanoma of the skin | 40-44 | 219 |
Melanoma of the skin | 45-49 | 219 |
Melanoma of the skin | 5-9 | 219 |
Melanoma of the skin | 50-54 | 219 |
Melanoma of the skin | 55-59 | 219 |
Melanoma of the skin | 60-64 | 219 |
Melanoma of the skin | 65-69 | 219 |
Melanoma of the skin | 70-74 | 219 |
Melanoma of the skin | 75-79 | 219 |
Melanoma of the skin | 80-84 | 219 |
Melanoma of the skin | 85+ | 219 |
Mesothelioma | 0-4 | 219 |
Mesothelioma | 10-14 | 219 |
Mesothelioma | 15-19 | 219 |
Mesothelioma | 20-24 | 219 |
Mesothelioma | 25-29 | 219 |
Mesothelioma | 30-34 | 219 |
Mesothelioma | 35-39 | 219 |
Mesothelioma | 40-44 | 219 |
Mesothelioma | 45-49 | 219 |
Mesothelioma | 5-9 | 219 |
Mesothelioma | 50-54 | 219 |
Mesothelioma | 55-59 | 219 |
Mesothelioma | 60-64 | 219 |
Mesothelioma | 65-69 | 219 |
Mesothelioma | 70-74 | 219 |
Mesothelioma | 75-79 | 219 |
Mesothelioma | 80-84 | 219 |
Mesothelioma | 85+ | 219 |
Myeloma | 0-4 | 219 |
Myeloma | 10-14 | 219 |
Myeloma | 15-19 | 219 |
Myeloma | 20-24 | 219 |
Myeloma | 25-29 | 219 |
Myeloma | 30-34 | 219 |
Myeloma | 35-39 | 219 |
Myeloma | 40-44 | 219 |
Myeloma | 45-49 | 219 |
Myeloma | 5-9 | 219 |
Myeloma | 50-54 | 219 |
Myeloma | 55-59 | 219 |
Myeloma | 60-64 | 219 |
Myeloma | 65-69 | 219 |
Myeloma | 70-74 | 219 |
Myeloma | 75-79 | 219 |
Myeloma | 80-84 | 219 |
Myeloma | 85+ | 219 |
Non-Hodgkin lymphoma | 0-4 | 219 |
Non-Hodgkin lymphoma | 10-14 | 219 |
Non-Hodgkin lymphoma | 15-19 | 219 |
Non-Hodgkin lymphoma | 20-24 | 219 |
Non-Hodgkin lymphoma | 25-29 | 219 |
Non-Hodgkin lymphoma | 30-34 | 219 |
Non-Hodgkin lymphoma | 35-39 | 219 |
Non-Hodgkin lymphoma | 40-44 | 219 |
Non-Hodgkin lymphoma | 45-49 | 219 |
Non-Hodgkin lymphoma | 5-9 | 219 |
Non-Hodgkin lymphoma | 50-54 | 219 |
Non-Hodgkin lymphoma | 55-59 | 219 |
Non-Hodgkin lymphoma | 60-64 | 219 |
Non-Hodgkin lymphoma | 65-69 | 219 |
Non-Hodgkin lymphoma | 70-74 | 219 |
Non-Hodgkin lymphoma | 75-79 | 219 |
Non-Hodgkin lymphoma | 80-84 | 219 |
Non-Hodgkin lymphoma | 85+ | 219 |
Non-melanoma skin cancer, all types | 0-4 | 132 |
Non-melanoma skin cancer, all types | 10-14 | 132 |
Non-melanoma skin cancer, all types | 15-19 | 132 |
Non-melanoma skin cancer, all types | 20-24 | 132 |
Non-melanoma skin cancer, all types | 25-29 | 132 |
Non-melanoma skin cancer, all types | 30-34 | 132 |
Non-melanoma skin cancer, all types | 35-39 | 132 |
Non-melanoma skin cancer, all types | 40-44 | 132 |
Non-melanoma skin cancer, all types | 45-49 | 132 |
Non-melanoma skin cancer, all types | 5-9 | 132 |
Non-melanoma skin cancer, all types | 50-54 | 132 |
Non-melanoma skin cancer, all types | 55-59 | 132 |
Non-melanoma skin cancer, all types | 60-64 | 132 |
Non-melanoma skin cancer, all types | 65-69 | 132 |
Non-melanoma skin cancer, all types | 70-74 | 132 |
Non-melanoma skin cancer, all types | 75-79 | 132 |
Non-melanoma skin cancer, all types | 80-84 | 132 |
Non-melanoma skin cancer, all types | 85+ | 132 |
Non-melanoma skin cancer, rare types | 0-4 | 30 |
Non-melanoma skin cancer, rare types | 10-14 | 30 |
Non-melanoma skin cancer, rare types | 15-19 | 30 |
Non-melanoma skin cancer, rare types | 20-24 | 30 |
Non-melanoma skin cancer, rare types | 25-29 | 30 |
Non-melanoma skin cancer, rare types | 30-34 | 30 |
Non-melanoma skin cancer, rare types | 35-39 | 30 |
Non-melanoma skin cancer, rare types | 40-44 | 30 |
Non-melanoma skin cancer, rare types | 45-49 | 30 |
Non-melanoma skin cancer, rare types | 5-9 | 30 |
Non-melanoma skin cancer, rare types | 50-54 | 30 |
Non-melanoma skin cancer, rare types | 55-59 | 30 |
Non-melanoma skin cancer, rare types | 60-64 | 30 |
Non-melanoma skin cancer, rare types | 65-69 | 30 |
Non-melanoma skin cancer, rare types | 70-74 | 30 |
Non-melanoma skin cancer, rare types | 75-79 | 30 |
Non-melanoma skin cancer, rare types | 80-84 | 30 |
Non-melanoma skin cancer, rare types | 85+ | 30 |
Oesophageal cancer | 0-4 | 219 |
Oesophageal cancer | 10-14 | 219 |
Oesophageal cancer | 15-19 | 219 |
Oesophageal cancer | 20-24 | 219 |
Oesophageal cancer | 25-29 | 219 |
Oesophageal cancer | 30-34 | 219 |
Oesophageal cancer | 35-39 | 219 |
Oesophageal cancer | 40-44 | 219 |
Oesophageal cancer | 45-49 | 219 |
Oesophageal cancer | 5-9 | 219 |
Oesophageal cancer | 50-54 | 219 |
Oesophageal cancer | 55-59 | 219 |
Oesophageal cancer | 60-64 | 219 |
Oesophageal cancer | 65-69 | 219 |
Oesophageal cancer | 70-74 | 219 |
Oesophageal cancer | 75-79 | 219 |
Oesophageal cancer | 80-84 | 219 |
Oesophageal cancer | 85+ | 219 |
Ovarian cancer | 0-4 | 73 |
Ovarian cancer | 10-14 | 73 |
Ovarian cancer | 15-19 | 73 |
Ovarian cancer | 20-24 | 73 |
Ovarian cancer | 25-29 | 73 |
Ovarian cancer | 30-34 | 73 |
Ovarian cancer | 35-39 | 73 |
Ovarian cancer | 40-44 | 73 |
Ovarian cancer | 45-49 | 73 |
Ovarian cancer | 5-9 | 73 |
Ovarian cancer | 50-54 | 73 |
Ovarian cancer | 55-59 | 73 |
Ovarian cancer | 60-64 | 73 |
Ovarian cancer | 65-69 | 73 |
Ovarian cancer | 70-74 | 73 |
Ovarian cancer | 75-79 | 73 |
Ovarian cancer | 80-84 | 73 |
Ovarian cancer | 85+ | 73 |
Pancreatic cancer | 0-4 | 219 |
Pancreatic cancer | 10-14 | 219 |
Pancreatic cancer | 15-19 | 219 |
Pancreatic cancer | 20-24 | 219 |
Pancreatic cancer | 25-29 | 219 |
Pancreatic cancer | 30-34 | 219 |
Pancreatic cancer | 35-39 | 219 |
Pancreatic cancer | 40-44 | 219 |
Pancreatic cancer | 45-49 | 219 |
Pancreatic cancer | 5-9 | 219 |
Pancreatic cancer | 50-54 | 219 |
Pancreatic cancer | 55-59 | 219 |
Pancreatic cancer | 60-64 | 219 |
Pancreatic cancer | 65-69 | 219 |
Pancreatic cancer | 70-74 | 219 |
Pancreatic cancer | 75-79 | 219 |
Pancreatic cancer | 80-84 | 219 |
Pancreatic cancer | 85+ | 219 |
Prostate cancer | 0-4 | 73 |
Prostate cancer | 10-14 | 73 |
Prostate cancer | 15-19 | 73 |
Prostate cancer | 20-24 | 73 |
Prostate cancer | 25-29 | 73 |
Prostate cancer | 30-34 | 73 |
Prostate cancer | 35-39 | 73 |
Prostate cancer | 40-44 | 73 |
Prostate cancer | 45-49 | 73 |
Prostate cancer | 5-9 | 73 |
Prostate cancer | 50-54 | 73 |
Prostate cancer | 55-59 | 73 |
Prostate cancer | 60-64 | 73 |
Prostate cancer | 65-69 | 73 |
Prostate cancer | 70-74 | 73 |
Prostate cancer | 75-79 | 73 |
Prostate cancer | 80-84 | 73 |
Prostate cancer | 85+ | 73 |
Rectal cancer | 0-4 | 219 |
Rectal cancer | 10-14 | 219 |
Rectal cancer | 15-19 | 219 |
Rectal cancer | 20-24 | 219 |
Rectal cancer | 25-29 | 219 |
Rectal cancer | 30-34 | 219 |
Rectal cancer | 35-39 | 219 |
Rectal cancer | 40-44 | 219 |
Rectal cancer | 45-49 | 219 |
Rectal cancer | 5-9 | 219 |
Rectal cancer | 50-54 | 219 |
Rectal cancer | 55-59 | 219 |
Rectal cancer | 60-64 | 219 |
Rectal cancer | 65-69 | 219 |
Rectal cancer | 70-74 | 219 |
Rectal cancer | 75-79 | 219 |
Rectal cancer | 80-84 | 219 |
Rectal cancer | 85+ | 219 |
Stomach cancer | 0-4 | 219 |
Stomach cancer | 10-14 | 219 |
Stomach cancer | 15-19 | 219 |
Stomach cancer | 20-24 | 219 |
Stomach cancer | 25-29 | 219 |
Stomach cancer | 30-34 | 219 |
Stomach cancer | 35-39 | 219 |
Stomach cancer | 40-44 | 219 |
Stomach cancer | 45-49 | 219 |
Stomach cancer | 5-9 | 219 |
Stomach cancer | 50-54 | 219 |
Stomach cancer | 55-59 | 219 |
Stomach cancer | 60-64 | 219 |
Stomach cancer | 65-69 | 219 |
Stomach cancer | 70-74 | 219 |
Stomach cancer | 75-79 | 219 |
Stomach cancer | 80-84 | 219 |
Stomach cancer | 85+ | 219 |
Testicular cancer | 0-4 | 73 |
Testicular cancer | 10-14 | 73 |
Testicular cancer | 15-19 | 73 |
Testicular cancer | 20-24 | 73 |
Testicular cancer | 25-29 | 73 |
Testicular cancer | 30-34 | 73 |
Testicular cancer | 35-39 | 73 |
Testicular cancer | 40-44 | 73 |
Testicular cancer | 45-49 | 73 |
Testicular cancer | 5-9 | 73 |
Testicular cancer | 50-54 | 73 |
Testicular cancer | 55-59 | 73 |
Testicular cancer | 60-64 | 73 |
Testicular cancer | 65-69 | 73 |
Testicular cancer | 70-74 | 73 |
Testicular cancer | 75-79 | 73 |
Testicular cancer | 80-84 | 73 |
Testicular cancer | 85+ | 73 |
Thyroid cancer | 0-4 | 219 |
Thyroid cancer | 10-14 | 219 |
Thyroid cancer | 15-19 | 219 |
Thyroid cancer | 20-24 | 219 |
Thyroid cancer | 25-29 | 219 |
Thyroid cancer | 30-34 | 219 |
Thyroid cancer | 35-39 | 219 |
Thyroid cancer | 40-44 | 219 |
Thyroid cancer | 45-49 | 219 |
Thyroid cancer | 5-9 | 219 |
Thyroid cancer | 50-54 | 219 |
Thyroid cancer | 55-59 | 219 |
Thyroid cancer | 60-64 | 219 |
Thyroid cancer | 65-69 | 219 |
Thyroid cancer | 70-74 | 219 |
Thyroid cancer | 75-79 | 219 |
Thyroid cancer | 80-84 | 219 |
Thyroid cancer | 85+ | 219 |
Tongue cancer | 0-4 | 219 |
Tongue cancer | 10-14 | 219 |
Tongue cancer | 15-19 | 219 |
Tongue cancer | 20-24 | 219 |
Tongue cancer | 25-29 | 219 |
Tongue cancer | 30-34 | 219 |
Tongue cancer | 35-39 | 219 |
Tongue cancer | 40-44 | 219 |
Tongue cancer | 45-49 | 219 |
Tongue cancer | 5-9 | 219 |
Tongue cancer | 50-54 | 219 |
Tongue cancer | 55-59 | 219 |
Tongue cancer | 60-64 | 219 |
Tongue cancer | 65-69 | 219 |
Tongue cancer | 70-74 | 219 |
Tongue cancer | 75-79 | 219 |
Tongue cancer | 80-84 | 219 |
Tongue cancer | 85+ | 219 |
Unknown primary site | 0-4 | 219 |
Unknown primary site | 10-14 | 219 |
Unknown primary site | 15-19 | 219 |
Unknown primary site | 20-24 | 219 |
Unknown primary site | 25-29 | 219 |
Unknown primary site | 30-34 | 219 |
Unknown primary site | 35-39 | 219 |
Unknown primary site | 40-44 | 219 |
Unknown primary site | 45-49 | 219 |
Unknown primary site | 5-9 | 219 |
Unknown primary site | 50-54 | 219 |
Unknown primary site | 55-59 | 219 |
Unknown primary site | 60-64 | 219 |
Unknown primary site | 65-69 | 219 |
Unknown primary site | 70-74 | 219 |
Unknown primary site | 75-79 | 219 |
Unknown primary site | 80-84 | 219 |
Unknown primary site | 85+ | 219 |
Uterine cancer | 0-4 | 73 |
Uterine cancer | 10-14 | 73 |
Uterine cancer | 15-19 | 73 |
Uterine cancer | 20-24 | 73 |
Uterine cancer | 25-29 | 73 |
Uterine cancer | 30-34 | 73 |
Uterine cancer | 35-39 | 73 |
Uterine cancer | 40-44 | 73 |
Uterine cancer | 45-49 | 73 |
Uterine cancer | 5-9 | 73 |
Uterine cancer | 50-54 | 73 |
Uterine cancer | 55-59 | 73 |
Uterine cancer | 60-64 | 73 |
Uterine cancer | 65-69 | 73 |
Uterine cancer | 70-74 | 73 |
Uterine cancer | 75-79 | 73 |
Uterine cancer | 80-84 | 73 |
Uterine cancer | 85+ | 73 |
sample_n()
helper
The sample_n()
function is used to randomly select a specified number of rows from a data frame.
Example 1: Sample 10 rows from the cancer_mort dataset.
|> sample_n(10) cancer_mort
Output
Year | Sex | Type | Cancer_Type | Age | Count |
---|---|---|---|---|---|
2009 | Male | Incidence | Laryngeal cancer | 55-59 | 64 |
1974 | Female | Mortality | Stomach cancer | 70-74 | 102 |
1985 | Persons | Incidence | Bowel cancer | 85+ | 531 |
1997 | Female | Mortality | Breast cancer | 65-69 | 289 |
1971 | Female | Mortality | Chronic myeloid leukaemia | 65-69 | 6 |
2002 | Male | Incidence | Anal cancer | 55-59 | 2 |
1970 | Female | Mortality | Bladder cancer | 70-74 | 16 |
2002 | Persons | Incidence | Chronic myeloid leukaemia | 40-44 | 17 |
1985 | Male | Mortality | Chronic myeloid leukaemia | 25-29 | 4 |
1996 | Male | Incidence | Liver cancer | 60-64 | 45 |
Example 2: Sample 3 rows randomly from each cancer type.
|>
cancer_mort group_by(Cancer_Type) |>
sample_n(3)
Output
Year | Sex | Type | Cancer_Type | Age | Count |
---|---|---|---|---|---|
2004 | Male | Mortality | Acute lymphoblastic leukaemia | 75-79 | 7 |
1993 | Male | Incidence | Acute lymphoblastic leukaemia | 30-34 | 4 |
1995 | Male | Mortality | Acute lymphoblastic leukaemia | 35-39 | 5 |
1972 | Male | Mortality | Acute myeloid leukaemia | 55-59 | 15 |
1976 | Female | Mortality | Acute myeloid leukaemia | 85+ | 5 |
1983 | Male | Incidence | Acute myeloid leukaemia | 50-54 | 14 |
1983 | Persons | Mortality | Anal cancer | 10-14 | 0 |
1987 | Male | Incidence | Anal cancer | 15-19 | 0 |
1978 | Male | Mortality | Anal cancer | 70-74 | 0 |
1990 | Persons | Mortality | Bladder cancer | 50-54 | 17 |
2001 | Persons | Incidence | Bladder cancer | 55-59 | 128 |
1990 | Male | Mortality | Bladder cancer | 40-44 | 1 |
1993 | Persons | Mortality | Bowel cancer | 50-54 | 198 |
1990 | Persons | Mortality | Bowel cancer | 70-74 | 630 |
1985 | Female | Mortality | Bowel cancer | 80-84 | 263 |
1997 | Persons | Incidence | Brain cancer | 85+ | 36 |
2002 | Male | Mortality | Brain cancer | 25-29 | 10 |
1982 | Persons | Mortality | Brain cancer | 65-69 | 103 |
2001 | Female | Mortality | Breast cancer | 70-74 | 309 |
1970 | Persons | Mortality | Breast cancer | 0-4 | 0 |
1995 | Female | Mortality | Breast cancer | 45-49 | 214 |
1971 | Female | Mortality | Cervical cancer | 30-34 | 8 |
2001 | Female | Incidence | Cervical cancer | 80-84 | 41 |
1997 | Female | Mortality | Cervical cancer | 50-54 | 19 |
1969 | Male | Mortality | Chronic lymphocytic leukaemia | 75-79 | 12 |
1986 | Male | Mortality | Chronic lymphocytic leukaemia | 15-19 | 0 |
2008 | Female | Mortality | Chronic lymphocytic leukaemia | 70-74 | 12 |
1988 | Male | Incidence | Chronic myeloid leukaemia | 45-49 | 2 |
1982 | Persons | Incidence | Chronic myeloid leukaemia | 25-29 | 6 |
1978 | Persons | Mortality | Chronic myeloid leukaemia | 50-54 | 8 |
1995 | Persons | Incidence | Colon cancer | 40-44 | 125 |
1985 | Male | Mortality | Colon cancer | 5-9 | 0 |
1969 | Male | Mortality | Colon cancer | 30-34 | 6 |
1997 | Persons | Mortality | Head and neck excluding lip | 20-24 | 1 |
2003 | Female | Mortality | Head and neck excluding lip | 25-29 | 0 |
1972 | Male | Mortality | Head and neck excluding lip | 45-49 | 27 |
1991 | Persons | Incidence | Head and neck including lip | 10-14 | 1 |
1971 | Male | Mortality | Head and neck including lip | 85+ | 18 |
1970 | Female | Mortality | Head and neck including lip | 35-39 | 3 |
2003 | Female | Mortality | Hodgkin lymphoma | 65-69 | 3 |
1983 | Male | Mortality | Hodgkin lymphoma | 60-64 | 5 |
1970 | Female | Mortality | Hodgkin lymphoma | 85+ | 3 |
1989 | Female | Mortality | Kidney cancer | 40-44 | 2 |
1994 | Male | Incidence | Kidney cancer | 50-54 | 71 |
1989 | Persons | Mortality | Kidney cancer | 25-29 | 1 |
1986 | Male | Incidence | Laryngeal cancer | 15-19 | 0 |
2006 | Male | Mortality | Laryngeal cancer | 25-29 | 0 |
2007 | Persons | Incidence | Laryngeal cancer | 25-29 | 1 |
1997 | Male | Mortality | Liver cancer | 30-34 | 1 |
1995 | Female | Mortality | Liver cancer | 15-19 | 1 |
1990 | Female | Incidence | Liver cancer | 35-39 | 4 |
1981 | Persons | Mortality | Lung cancer | 20-24 | 0 |
1985 | Male | Incidence | Lung cancer | 55-59 | 567 |
1986 | Male | Mortality | Lung cancer | 45-49 | 86 |
2010 | Persons | Incidence | Melanoma of the skin | 85+ | 803 |
2002 | Female | Incidence | Melanoma of the skin | 20-24 | 93 |
2002 | Persons | Incidence | Melanoma of the skin | 55-59 | 1013 |
1991 | Male | Incidence | Mesothelioma | 65-69 | 42 |
1992 | Female | Mortality | Mesothelioma | 60-64 | 0 |
1974 | Male | Mortality | Mesothelioma | 80-84 | 0 |
1973 | Persons | Mortality | Myeloma | 60-64 | 36 |
1969 | Male | Mortality | Myeloma | 35-39 | 1 |
1982 | Persons | Mortality | Myeloma | 15-19 | 1 |
1982 | Persons | Mortality | Non-Hodgkin lymphoma | 40-44 | 14 |
1995 | Male | Incidence | Non-Hodgkin lymphoma | 40-44 | 77 |
2002 | Persons | Incidence | Non-Hodgkin lymphoma | 10-14 | 17 |
1998 | Male | Mortality | Non-melanoma skin cancer, all types | 75-79 | 46 |
1994 | Female | Mortality | Non-melanoma skin cancer, all types | 0-4 | 0 |
1984 | Female | Mortality | Non-melanoma skin cancer, all types | 85+ | 29 |
2009 | Female | Incidence | Non-melanoma skin cancer, rare types | 45-49 | 7 |
2005 | Male | Incidence | Non-melanoma skin cancer, rare types | 0-4 | 1 |
2001 | Persons | Incidence | Non-melanoma skin cancer, rare types | 75-79 | 95 |
1994 | Male | Mortality | Oesophageal cancer | 15-19 | 0 |
2003 | Male | Incidence | Oesophageal cancer | 5-9 | 0 |
1985 | Persons | Mortality | Oesophageal cancer | 85+ | 55 |
1993 | Female | Mortality | Ovarian cancer | 15-19 | 2 |
1983 | Female | Mortality | Ovarian cancer | 75-79 | 60 |
2007 | Female | Mortality | Ovarian cancer | 10-14 | 0 |
1984 | Male | Incidence | Pancreatic cancer | 60-64 | 95 |
1989 | Persons | Incidence | Pancreatic cancer | 85+ | 113 |
2002 | Persons | Incidence | Pancreatic cancer | 50-54 | 92 |
1968 | Male | Mortality | Prostate cancer | 25-29 | 0 |
1995 | Male | Mortality | Prostate cancer | 15-19 | 0 |
1979 | Male | Mortality | Prostate cancer | 35-39 | 0 |
1987 | Persons | Incidence | Rectal cancer | 50-54 | 194 |
1995 | Female | Mortality | Rectal cancer | 35-39 | 0 |
1974 | Male | Mortality | Rectal cancer | 5-9 | 0 |
1990 | Persons | Mortality | Stomach cancer | 25-29 | 2 |
1987 | Female | Mortality | Stomach cancer | 70-74 | 72 |
1968 | Male | Mortality | Stomach cancer | 50-54 | 55 |
2002 | Male | Mortality | Testicular cancer | 65-69 | 1 |
2005 | Male | Incidence | Testicular cancer | 80-84 | 1 |
2000 | Male | Mortality | Testicular cancer | 30-34 | 2 |
1986 | Persons | Mortality | Thyroid cancer | 75-79 | 10 |
2009 | Male | Mortality | Thyroid cancer | 30-34 | 0 |
2009 | Female | Incidence | Thyroid cancer | 35-39 | 147 |
1984 | Persons | Incidence | Tongue cancer | 75-79 | 21 |
1970 | Persons | Mortality | Tongue cancer | 25-29 | 0 |
1971 | Persons | Mortality | Tongue cancer | 20-24 | 0 |
1990 | Male | Incidence | Unknown primary site | 70-74 | 232 |
1998 | Female | Incidence | Unknown primary site | 70-74 | 200 |
2011 | Male | Mortality | Unknown primary site | 85+ | 244 |
2009 | Female | Mortality | Uterine cancer | 30-34 | 1 |
1989 | Female | Mortality | Uterine cancer | 0-4 | 0 |
1987 | Female | Incidence | Uterine cancer | 35-39 | 15 |
Try out the following examples by yourself first.
- Find the total number of male deaths in each year.
To find the total number of male deaths in each year, we begin by filtering out the rows where the Sex column contains “Male” and the type column conatins “Mortality”, as we are only interested in male deaths.
|> filter(Sex == "Male" & Type == "Mortality") cancer_mort
Since we need to compute the total number of deaths in each year, we group this filtered data frame by year. This will create a grouping for each year. This grouping allows us to compute the total number of deaths (i.e., Counts) for each year.
|>
cancer_mort filter(Sex == "Male" & Type == "Mortality") |>
group_by(Year)
Putting these operations/functions together, we obtain the final answer:
Check the Answer
|>
cancer_mort filter(Sex == "Male" & Type == "Mortality") |>
group_by(Year) |>
summarise(tot_male_deaths = sum(Count))
- Find the top three cancer types and the age group with the highest average cancer incidences reported across all years.
Hints:
- Since we are concerned only about cancer incidences, first filter the dataset using
filter()
function to include only the rows with cancer incidences. - Next, compute the average of cancer incidences reported across all years. If we use
summarise()
function next, we will have a single average value for the whole data frame. However, this question asks us to find the top three cancer types and age group, which means we need to compute the average after grouping the rows based on cancer type and age group. This can be achieved using thegroup_by()
function with multiple columns as arguments (e.g.,group_by(col1, col2)
). - Once grouped, use
summarise()
function to compute the mean across these groupings. - Since we’re interested in only the top three highest values, chain the result of the above functions to the
arrange()
function to sort the averages in descending order.
Check the Answer
|>
cancer_mort filter(Type == "Incidence") |>
group_by(Cancer_Type, Age) |>
summarise(average = mean(Count)) |>
arrange(desc(average))
- Find the year, cancer type and age group with the highest cancer count observed among age groups above 29.
Hints:
- Start by selecting the columns of interest using the
select()
function. Remember you need to include four columns. - Next, filter the rows to include only those with age groups above 29. We can use the
%in%
operator with a vector of age groups to filter rows using thefilter()
function. For example:filter(Age %in% c("30-34", "35-39", "40-44", "45-49", ..., "85+")
. This is time consuming and laborious as we need to type 10+ age groups. There are two ways to make this step easier:- Using the
unique()
function: This function returns a vector/data frame with duplicate elements removed. In other words, it returns the unique elements. Find the unique elements of the age column (i.e.,unique(cancer_mort$Age)
) and then create a char vector manually by copying and pasting the age groups of interest. - Using
!
operator: The easiest way is to select all the age groups that is less than or equal to 29 and then use the!
(NOT) operator to negate the logical vector. For example:filter(!Age %in% c('0-4', '5-9', '10-14','15-19', '20-24', '25-29'))
This will select the rows that does not contain age groups in the given vector.
- Using the
- Finnaly, sort the resulting data frame to find the highest cancer count.
Check the Answer
|>
cancer_mort filter(!Age %in% c('0-4', '5-9', '10-14','15-19', '20-24', '25-29')) |>
arrange(desc(Count))
- Find the minimum, maximum, number of observations and quartile statistics for each cancer type among women, men and other genders.
Check the Answer
|>
cancer_mort group_by(Cancer_Type, Sex) |>
summarise(minimum_count = min(Count),
quartile_1 = quantile(Count, probs = 0.25),
quartile_2 = median(Count),
quartile_3 = quantile(Count, probs = 0.75),
maximum_count = max(Count),
n_count = n())
- Find the percentage of deaths attributed to each type of cancer as a function of the total number of deaths.
Check the Answer
|>
cancer_mort filter(Type == "Incidence") |>
group_by(Cancer_Type) |>
summarise(tot_deaths = sum(Count)) |>
mutate(percent_deaths = (tot_deaths * 100)/sum(tot_deaths))
Cancer_Type | tot_deaths | percent_deaths |
---|---|---|
Acute lymphoblastic leukaemia | 15858 | 0.3325140 |
Acute myeloid leukaemia | 38964 | 0.8170057 |
Anal cancer | 3124 | 0.0655047 |
Bladder cancer | 125949 | 2.6409264 |
Bowel cancer | 628013 | 13.1683148 |
Brain cancer | 70161 | 1.4711513 |
Breast cancer | 566083 | 11.8697530 |
Cervical cancer | 26235 | 0.5501013 |
Chronic lymphocytic leukaemia | 45438 | 0.9527540 |
Chronic myeloid leukaemia | 15706 | 0.3293269 |
Colon cancer | 413957 | 8.6799415 |
Head and neck excluding lip | 137843 | 2.8903224 |
Head and neck including lip | 190334 | 3.9909652 |
Hodgkin lymphoma | 23429 | 0.4912644 |
Kidney cancer | 99109 | 2.0781393 |
Laryngeal cancer | 33924 | 0.7113259 |
Liver cancer | 39281 | 0.8236527 |
Lung cancer | 460393 | 9.6536218 |
Melanoma of the skin | 444228 | 9.3146705 |
Mesothelioma | 24668 | 0.5172441 |
Myeloma | 55763 | 1.1692509 |
Non-Hodgkin lymphoma | 173280 | 3.6333732 |
Non-melanoma skin cancer, rare types | 13417 | 0.2813306 |
Oesophageal cancer | 56338 | 1.1813076 |
Ovarian cancer | 31461 | 0.6596812 |
Pancreatic cancer | 102146 | 2.1418198 |
Prostate cancer | 314027 | 6.5845873 |
Rectal cancer | 214052 | 4.4882895 |
Stomach cancer | 109302 | 2.2918684 |
Testicular cancer | 15211 | 0.3189476 |
Thyroid cancer | 57611 | 1.2080001 |
Tongue cancer | 24497 | 0.5136585 |
Unknown primary site | 157731 | 3.3073383 |
Uterine cancer | 41589 | 0.8720473 |
Additional dplyr
functions
In this section, we’ll explore several additional functions from the dplyr
package. We’ll demonstrate these functions using both the cms_data and cancer_mort datasets.
across()
function
This function allows you to apply a transformation or calculation across multiple columns of a data frame. It is often used with functions like mutate()
or summarise()
, enabling you to perform the same operation on multiple columns at once.
# find the mean and standard deviation of star rating and overall rating for each hospital type.
|>
cms_data group_by(hospital_type) |>
summarise(across(
c(star_rating, overall_rating),
list(mean = mean, sd = sd)))
distinct
function
This function is used to select unique rows from a data frame, removing any duplicate rows.
# find the unique or distinct cancer types
|> distinct(Cancer_Type) cancer_mort
slice()
function
This function extracts specific rows from a data frame based on their position. You can specify the row numbers or a range of row numbers to extract.
# select a range of rows
|> slice(2:5)
cancer_mort
# drop rows with negative indices
|> slice(-(10:n())) cancer_mort
slice_head()/slice_tail()
functions
The slice_head()
function extracts the first few rows (specified by a number) and the slice_tail()
function extracts the last few rows (specified by a number) from a data frame.
# similar to head(3)
|> slice_head(n = 3)
cancer_mort
# similar to tail(4)
|> slice_tail(n = 4) cancer_mort
slice_min()/slice_max()
function
The slice_min()
function extracts the rows with the minimum values of a specified variable and the slice_max()
function extracts the rows with the maximum values of a specified variable from a data frame.
# 5 rows with the minimum overall rating
|> slice_min(overall_rating, n = 5)
cms_data
# 4 rows with the maximum 5 overall rating
|> slice_max(overall_rating, n = 6) cms_data
slice_sample()
function
This function randomly samples a specified number of rows from a data frame.
# randomly select 5 rows
|> slice_sample(n = 5)
cancer_mort
# randomly select 3 rows from each groupings
|> group_by(Age) |> slice_sample(n = 3) cancer_mort
add_row()
function
This function adds one or more rows to a data frame. You can specify the values for each column of the new rows.
# add a row before the 3rd row. Missing values are entered as NA
|> add_row(Year = 2024, Type = "Mortality", Count = 100, .before = 3) cancer_mort
relocate()
function
This function allows you to change the position of columns within a data frame. You can specify the target position where you want to move the column to.
# move the Cancer type column to be positioned after the Year column
|> relocate(Cancer_Type, .after = Year) cancer_mort
Joining Data Frames
Often, data originates from various sources or files, and the need arises to consolidate them for analysis. These datasets, when merged, are often referred to as relational data due to the inherent relationships between them that we aim to leverage. Within the tidyverse framework, this process of merging related data is termed joining. Here, we combine data from multiple datasets based on a common variable or set of variables.
For illustration purposes, let’s use the following two data frames in the subsequent examples.
= data.frame(
left key1 = c("K0", "K1", "K2", "K3"),
A = c("A0", "A1", "A2", "A3"),
B = c("B0", "B1", "B2", "B3"))
= data.frame(
right key1 = c("K0", "K1", "K1", "K4"),
C = c("C0", "C1", "C2", "C3"),
D = c("D0", "D1", "D2", "D3"))
There are several types of joins commonly used in R outlined below.
Left Join
A left join returns all rows from the left data frame (the first data frame specified) and matching rows from the right data frame (the second data frame specified). Non-matching rows in the right data frame have NULL (or NA) values in the result.
left
key1 | A | B |
---|---|---|
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
K3 | A3 | B3 |
right
key1 | C | D |
---|---|---|
K0 | C0 | D0 |
K1 | C1 | D1 |
K1 | C2 | D2 |
K4 | C3 | D3 |
left_join(left, right, by = "key1")
key1 | A | B | C | D |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K1 | A1 | B1 | C2 | D2 |
K2 | A2 | B2 | NA | NA |
K3 | A3 | B3 | NA | NA |
Right join
A right join is similar to a left join but returns all rows from the right data frame and matching rows from the left data frame. Non-matching rows in the left data frame have NULL (or NA) values in the result.
left
key1 | A | B |
---|---|---|
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
K3 | A3 | B3 |
right
key1 | C | D |
---|---|---|
K0 | C0 | D0 |
K1 | C1 | D1 |
K1 | C2 | D2 |
K4 | C3 | D3 |
right_join(left, right, by = "key1")
key1 | A | B | C | D |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K1 | A1 | B1 | C2 | D2 |
K4 | NA | NA | C3 | D3 |
Inner join
An inner join returns only the rows with matching values in the specified columns (the common key). It combines data from two or more tables or DataFrames based on the intersection of keys, excluding rows that do not have corresponding matches in both tables.
left
key1 | A | B |
---|---|---|
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
K3 | A3 | B3 |
right
key1 | C | D |
---|---|---|
K0 | C0 | D0 |
K1 | C1 | D1 |
K1 | C2 | D2 |
K4 | C3 | D3 |
inner_join(left, right, by = "key1")
key1 | A | B | C | D |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K1 | A1 | B1 | C2 | D2 |
Full Join
A full join returns all rows from both datasets, filling in NA values for non-matching rows.
left
key1 | A | B |
---|---|---|
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
K3 | A3 | B3 |
right
key1 | C | D |
---|---|---|
K0 | C0 | D0 |
K1 | C1 | D1 |
K1 | C2 | D2 |
K4 | C3 | D3 |
full_join(left, right, by = "key1")
key1 | A | B | C | D |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | C1 | D1 |
K1 | A1 | B1 | C2 | D2 |
K2 | A2 | B2 | NA | NA |
K3 | A3 | B3 | NA | NA |
K4 | NA | NA | C3 | D3 |
Binding multiple data frames
The dplyr
package provides convenient functions forbinding data frames by row or column to combine one or more data frames into one data frame.
For illustration purposes, let’s use the following data frames in the subsequent examples.
= data.frame(
df1 ID = 1:3,
name = c("Eva", "Charlie", "John"),
occupation = c("Doctor", "Nurse", "Manager"))
= data.frame(
df2 ID = 4:6,
name = c("Peter", "Jane", "Alice"),
occupation = c("Radiographer", "Therapist", "Consultant"))
= data.frame(
df3 location = c("Parkville", "Box Hill", "East Melbourne"),
grade = c("A", "A", "C"))
- Binding by Row:
- To bind data frames by row, you can use the
bind_rows()
function. - This function stacks data frames on top of each other, matching columns by name.
- To bind data frames by row, you can use the
df1
ID | name | occupation |
---|---|---|
1 | Eva | Doctor |
2 | Charlie | Nurse |
3 | John | Manager |
df2
ID | name | occupation |
---|---|---|
4 | Peter | Radiographer |
5 | Jane | Therapist |
6 | Alice | Consultant |
bind_rows(df1, df2)
ID | name | occupation |
---|---|---|
1 | Eva | Doctor |
2 | Charlie | Nurse |
3 | John | Manager |
4 | Peter | Radiographer |
5 | Jane | Therapist |
6 | Alice | Consultant |
- Binding by Column:
- To bind data frames by column, you can use the
bind_cols()
function fromdplyr
. - This function appends columns from one data frame to another, matching rows by position.
- To bind data frames by column, you can use the
df1
ID | name | occupation |
---|---|---|
1 | Eva | Doctor |
2 | Charlie | Nurse |
3 | John | Manager |
df3
location | grade |
---|---|
Parkville | A |
Box Hill | A |
East Melbourne | C |
bind_cols(df1, df3)
ID | name | occupation | location | grade |
---|---|---|---|---|
1 | Eva | Doctor | Parkville | A |
2 | Charlie | Nurse | Box Hill | A |
3 | John | Manager | East Melbourne | C |
That wraps up our exploration of data transformation using the dplyr
package. In the next section, we’ll delve into visualization with the ggplot2
package from the tidyverse suite. We’ll apply the functions we’ve learned in this section to generate various plots. Stay tuned for more!