Into the Tidyverse

Common tasks in working with data include actions like removing rows or columns, performing calculations, or adding new columns. This sort of operations is known as data manipulation. It is the process of cleaning, organizing, and transforming raw data into a more structured and usable format for analysis.

In this workshop, we’ll guide you through the process of data manipulation in R, starting with the tidyverse. The tidyverse is a collection of packages that align with a data science philosophy developed by Hadley Wickham and the RStudio team. Many users find it to be a more intuitive way to grasp R concepts. Although some tasks may be more straightforward in base-R, we’ll also highlight those aspects.

If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")), let’s proceed to load it into our R session:

library(tidyverse)

The following are key techniques outlined in Hadley Wickham and Garrett Grolemund’s book, R for Data Science:

This workshop focuses on:

This workflow is of utmost importance. Instead of constructing analyses based on the unconventional format of your data, take measures to tidy up your data. Tidy data enables the use of number of analytical and visualization tools. It eliminates the need to develop ad-hoc methods to accommodate your data. This not only saves time but also enhances the clarity and comprehensibility of your work, benefiting both your collaborators and, most importantly, your future self.

Step 1: Importing data

First we need to import data into our R session. This can be achieved either by using datasets bundled with R packages or by importing external data into our workspace for data manipulation. This is the first step in the tidyverse workflow.

Preloaded Data in R Packages

A great way to learn data science tools is using the data provided by R packages. The data() function is a convenient way to explore and import pre-loaded datasets that come bundled with the R environment.

library(tidyverse)
data()
View Output

To import pre-loaded datasets bundled with a particular package (e.g., tidyr):

data(package = "tidyr")
View Output

When you load a package the pre-loaded datasets automatically imported into you R environment. Therefore you can access them directly as follows.

household
Output
family dob_child1 dob_child2 name_child1 name_child2
1 1998-11-26 2000-01-29 Susan Jose
2 1996-06-22 NA Mark NA
3 2002-07-11 2004-04-05 Sam Seth
4 2004-10-10 2009-08-27 Craig Khai
5 2000-12-05 2005-02-28 Parker Gracie
?household
View Output
<!DOCTYPE html> R: Household data
household R Documentation

Household data

Description

This dataset is based on an example in vignette(“datatable-reshape”, package = “data.table”)

Usage

household

Format

A data frame with 5 rows and 5 columns:

family

Family identifier

dob_child1

Date of birth of first child

dob_child2

Date of birth of second child

name_child1

Name of first child

?

name_child2

Name of second child

Reading the Data

At some point, you want to apply what you’ve learned to your own data. In this section, you’ll learn the basics of reading data files into R using the readr package. The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data.

We will use the read_csv() function from readr package to import a dataset. (See also read.csv() in base R.) CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.

The first argument of the read_csv() function takes the path to the file (or a web link). The following code will work if the cms_hospital_patient_satisfaction_2016_sampled.csv file is in the data/patient_satisfaction path.

# here the first argument is a path
cms_data <- read_csv("data/cms_hospital_patient_satisfaction.csv")
Output
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, Facility Name, County, Hospital Type
dbl (4): Star Rating, No of Surveys, Response Rate, Overall Rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# here the first argument is a web link
mtvcars <- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")
Output
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This command prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).

Other file types

Once you’ve mastered read_csv(), using other functions in the readr package is straightforward. It’s just a matter of knowing which function to use:

  • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker or thousands seperator.
  • read_tsv()reads tab-delimited files.
  • read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.

Exploring the Data

In the previous section we imported a dataset that is bundled with the tidyr package, into a dataframe named household. In the next section we read in a CSV file and created a data frame named cms_data. This section demonstrates different ways to get to know these two data objects.

The class() function is used to identify the data type or data structure of an object (or variable):

class(household)
class(cms_data)
Output
[1] "tbl_df"     "tbl"        "data.frame"
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:

cms_data
Output
ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating
050424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
040062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3
151317 GREENE COUNTY GENERAL HOSPITAL GREENE Critical Access Hospital 3 114 22 3
061327 SOUTHWEST MEMORIAL HOSPITAL MONTEZUMA Critical Access Hospital 4 247 34 3
490057 SENTARA GENERAL HOSPITAL VIRGINIA BEACH Acute Care Hospital 4 619 32 3
110215 PIEDMONT FAYETTE HOSPITAL FAYETTE Acute Care Hospital 2 1714 21 2
050704 MISSION COMMUNITY HOSPITAL LOS ANGELES Acute Care Hospital 3 241 14 3
100296 DOCTORS HOSPITAL MIAMI-DADE Acute Care Hospital 4 393 24 3
440003 SUMNER REGIONAL MEDICAL CENTER SUMNER Acute Care Hospital 4 680 35 2
501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3
050116 NORTHRIDGE MEDICAL CENTER LOS ANGELES Acute Care Hospital 3 1110 20 2

The dim() function prints the dimensions (rows x columns) of the data frame:

dim(cms_data)
Output
[1] 15  8

This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).

The nrow() function prints the number of rows while ncol() prints the number of columns:

nrow(cms_data)
ncol(cms_data)
Output
[1] 15
[1] 8

The View() function gives a spreadsheet-like view of the data frame:

View(cms_data)

By clicking the object on the environment tab also gives a spreadsheet-like view of the object:

The glimpse()function (dplyr package) displays a compact summary of the data frame, showing you key details such as the data types of each column, the first few values, and the total number of observations.

glimpse(cms_data)
Output
Rows: 15
Columns: 8
$ ID               <chr> "050424", "140103", "100051", "040062", "440048", "45…
$ `Facility Name`  <chr> "SCRIPPS GREEN HOSPITAL", "ST BERNARD HOSPITAL", "SOU…
$ County           <chr> "SAN DIEGO", "COOK", "LAKE", "SEBASTIAN", "SHELBY", "…
$ `Hospital Type`  <chr> "Acute Care Hospital", "Acute Care Hospital", "Acute …
$ `Star Rating`    <dbl> 4, 1, 2, 3, 2, 3, 3, 4, 4, 2, 3, 4, 4, 3, 3
$ `No of Surveys`  <dbl> 3110, 264, 1382, 2506, 1799, 1379, 114, 247, 619, 171…
$ `Response Rate`  <dbl> 41, 6, 20, 35, 18, 24, 22, 34, 32, 21, 14, 24, 35, 29…
$ `Overall Rating` <dbl> 5, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2

The head() function prints the top 6 rows of a data frame:

head(cms_data)
Output
ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating
050424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
040062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3

Similarly, the tail() function prints the bottom 6 rows of the data frame:

tail(cms_data)
Output
ID Facility Name County Hospital Type Star Rating No of Surveys Response Rate Overall Rating
110215 PIEDMONT FAYETTE HOSPITAL FAYETTE Acute Care Hospital 2 1714 21 2
050704 MISSION COMMUNITY HOSPITAL LOS ANGELES Acute Care Hospital 3 241 14 3
100296 DOCTORS HOSPITAL MIAMI-DADE Acute Care Hospital 4 393 24 3
440003 SUMNER REGIONAL MEDICAL CENTER SUMNER Acute Care Hospital 4 680 35 2
501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3
050116 NORTHRIDGE MEDICAL CENTER LOS ANGELES Acute Care Hospital 3 1110 20 2

The colnames() function displays all the column names:

colnames(cms_data)
[1] "ID"             "Facility Name"  "County"         "Hospital Type" 
[5] "Star Rating"    "No of Surveys"  "Response Rate"  "Overall Rating"

The $ symbol allows access to individual columns. To display ‘Hospital Type’ column:

cms_data$Hospital Type
Error: <text>:1:19: unexpected symbol
1: cms_data$Hospital Type
                      ^

Since the column names contain spaces, they need to be enclosed within `` (backticks) for R to interpret it as a single variable:

cms_data$`Hospital Type` # or cms_data$"Hospital Type"

It is a good practice to rename all the columns with spaces into a format that R can interpret. Conventionally, _ (underscore) is used to separate words in column names and variables instead of (space). We can rename a single column using the rename() function:

cms_data <- rename(cms_data, Hospital_Type = "Hospital Type")
head(cms_data)
Output
ID Facility Name County Hospital_Type Star Rating No of Surveys Response Rate Overall Rating
050424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
040062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3

Note: To modify the original cms_data object, it is essential to assign the renamed object to cms_data. Otherwise, a copy of the cms_data object is changed.

Or by directly assigning a new column name:

colnames(cms_data)[2] <- "Facility_Name"
head(cms_data)
Output
ID Facility_Name County Hospital_Type Star Rating No of Surveys Response Rate Overall Rating
050424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
040062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3

Using the above methods to rename multiple columns just to replace space with underscore can be laborious. A quick way to replace all spaces in the column names with underscore is shown below.

Option 1:

# by string substituting _ in place of ' '(space)
colnames(cms_data) <- gsub(" ", "_", colnames(cms_data))
colnames(cms_data)
Output
[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star_Rating"    "No_of_Surveys"  "Response_Rate"  "Overall_Rating"

Here the gsub() function replace space with underscores in the column names of the cms_data. This new object needs to be assigned (<-) to colnames(cms_data) in order for the changes to be saved in the cms_data object.

Option 2:

Here we use the janitor package which contains functions to clean data. You first need to install and load the library before using it.

# using janitor package
library(janitor) # remember to install janitor: install.packages("janitor")
cms_data <- cms_data |> clean_names()
cms_data
Output
id facility_name county hospital_type star_rating no_of_surveys response_rate overall_rating
050424 SCRIPPS GREEN HOSPITAL SAN DIEGO Acute Care Hospital 4 3110 41 5
140103 ST BERNARD HOSPITAL COOK Acute Care Hospital 1 264 6 2
100051 SOUTH LAKE HOSPITAL LAKE Acute Care Hospital 2 1382 20 2
040062 MERCY HOSPITAL FORT SMITH SEBASTIAN Acute Care Hospital 3 2506 35 3
440048 BAPTIST MEMORIAL HOSPITAL SHELBY Acute Care Hospital 2 1799 18 2
450011 ST JOSEPH REGIONAL HEALTH CENTER BRAZOS Acute Care Hospital 3 1379 24 3
151317 GREENE COUNTY GENERAL HOSPITAL GREENE Critical Access Hospital 3 114 22 3
061327 SOUTHWEST MEMORIAL HOSPITAL MONTEZUMA Critical Access Hospital 4 247 34 3
490057 SENTARA GENERAL HOSPITAL VIRGINIA BEACH Acute Care Hospital 4 619 32 3
110215 PIEDMONT FAYETTE HOSPITAL FAYETTE Acute Care Hospital 2 1714 21 2
050704 MISSION COMMUNITY HOSPITAL LOS ANGELES Acute Care Hospital 3 241 14 3
100296 DOCTORS HOSPITAL MIAMI-DADE Acute Care Hospital 4 393 24 3
440003 SUMNER REGIONAL MEDICAL CENTER SUMNER Acute Care Hospital 4 680 35 2
501339 WHIDBEY GENERAL HOSPITAL ISLAND Critical Access Hospital 3 389 29 3
050116 NORTHRIDGE MEDICAL CENTER LOS ANGELES Acute Care Hospital 3 1110 20 2

Here we are sending the cms_data data frame into the function clean_names() which replaces replaces spaces with underscore in column names. Note that it also convert all characters in column names to lower case.

The default format is snake case styled as snake_case. You can specify other formats like camel case, title case.

Option 3:

Additionally, we can use the make.names() function in base R which replaces ’ ’ (spaces) with . (dots).

# using make.names() function
colnames(cms_data) <-  make.names(colnames(cms_data))
colnames(cms_data) 
Output
[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star.Rating"    "No.of.Surveys"  "Response.Rate"  "Overall.Rating"

The str() function shows the structure of the data:

str(cms_data)
Output
spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID            : chr [1:15] "050424" "140103" "100051" "040062" ...
 $ Facility_Name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
 $ County        : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
 $ Hospital_Type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
 $ Star Rating   : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
 $ No of Surveys : num [1:15] 3110 264 1382 2506 1799 ...
 $ Response Rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
 $ Overall Rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_character(),
  ..   `Facility Name` = col_character(),
  ..   County = col_character(),
  ..   `Hospital Type` = col_character(),
  ..   `Star Rating` = col_double(),
  ..   `No of Surveys` = col_double(),
  ..   `Response Rate` = col_double(),
  ..   `Overall Rating` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

The summary() function generates summary statistics:

summary(cms_data)
Output
      ID            Facility_Name         County          Hospital_Type     
 Length:15          Length:15          Length:15          Length:15         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Star Rating  No of Surveys    Response Rate Overall Rating 
 Min.   :1.0   Min.   : 114.0   Min.   : 6    Min.   :2.000  
 1st Qu.:2.5   1st Qu.: 326.5   1st Qu.:20    1st Qu.:2.000  
 Median :3.0   Median : 680.0   Median :24    Median :3.000  
 Mean   :3.0   Mean   :1063.1   Mean   :25    Mean   :2.733  
 3rd Qu.:4.0   3rd Qu.:1548.0   3rd Qu.:33    3rd Qu.:3.000  
 Max.   :4.0   Max.   :3110.0   Max.   :41    Max.   :5.000  

A statitical overview can be obtained using the skim() function in skimr package:

library(skimr)
skim(cms_data)
Output
Data summary
Name cms_data
Number of rows 15
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 6 6 0 15 0
Facility_Name 0 1 16 32 0 15 0
County 0 1 4 14 0 14 0
Hospital_Type 0 1 19 24 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Star Rating 0 1 3.00 0.93 1 2.5 3 4 4 ▁▃▁▇▇
No of Surveys 0 1 1063.13 909.05 114 326.5 680 1548 3110 ▇▁▃▁▁
Response Rate 0 1 25.00 9.30 6 20.0 24 33 41 ▂▇▇▆▆
Overall Rating 0 1 2.73 0.80 2 2.0 3 3 5 ▆▇▁▁▁

Writing Data to a File

Writing data to a file is a fundamental operation in programming and data analysis. It involves taking data from within a program or environment and storing it in a file on a disk for later use or sharing. This section explains the basics of writing a data file using the readr package.

The write_csv() and write_tsv() functions are part of the readr package, which is designed for writing delimited files like CSV (comma-separated values) and TSV (tab-separated values). These functions are used to write data frames into CSV and TSV files, respectively.

We first provide the variable name of the data frame followed by the file name (ideally including the full folder location).

To write a CSV file:

# on Mac:
write_csv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_csv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

To write a TSV file:

# on Mac:
write_tsv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_tsv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

Step 2: Tidy Data

Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, popularized by Hadley Wickham, are designed to promote consistency and ease of use in data analysis.

This is the second step in the tidyverse workflow.

Let’s take a look at some examples.

Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.

For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.

AirPassengers
Output
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?

Data needs to be reshaped to conform to the tidy data structure. It involves using four primary verbs (or pairs of opposites):

  • Convert columns into rows (pivot_longer()).
  • Convert rows into columns (pivot_wider()).
  • Convert a character column into multiple columns (separate_wider_delim() and separate_wider_position()).
  • Combine multiple character columns into a single column (unite()).

First, load the tidyr package. Since you have already installed the tidyverse, you should be able to load it directly as follows (otherwise install it using the command install.packages("tidyverse") if necessary):

library(tidyverse)

Converting data from wide to long format

First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).

counts <- read_csv("data/GSE60450_normalized_data.csv")
head(counts)
Output
X gene_symbol GSM1480291 GSM1480292 GSM1480293 GSM1480294 GSM1480295 GSM1480296 GSM1480297 GSM1480298 GSM1480299 GSM1480300 GSM1480301 GSM1480302
ENSMUSG00000031805 Jak3 82.22869 81.18604 36.13808 36.62751 12.35039 11.85235 10.59006 14.88337 7.57182 7.05763 3.16982 4.25451
ENSMUSG00000033909 Usp36 88.22360 94.59098 76.11531 63.73863 27.10872 24.87050 95.67017 100.73912 78.07470 59.35009 36.36491 40.37057
ENSMUSG00000065524 Mir135a-2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
ENSMUSG00000030374 Strn4 94.26324 85.05204 59.54855 54.30684 24.42892 24.90936 125.00263 102.99289 110.64124 97.98130 77.96883 105.70093
ENSMUSG00000098547 Mir6921 0.08948 0.08404 0.00000 0.08459 0.03884 0.07772 0.00000 0.08505 0.07726 0.08255 0.04403 0.04727
ENSMUSG00000035629 Rubcn 37.84846 37.73554 21.32397 27.70326 30.13807 36.33424 33.84822 36.91075 27.46715 31.36722 56.00020 64.33764

To transform this table from a wide format to a long format, we use the pivot_longer() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from pivot_longer() is often necessary for ggplot, where each aesthetic or facet category must be a single column of values and for left_join(), which will be introduced later.

This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.

The pivot_longer() function takes three arguments:

  1. cols = : a vector indicating the names of the columns to be converted into labels in long form.
  2. names_to = : a name or vector of names for the new column(s) containing the labels from the specified columns.
  3. **values_to =* *: a name for the new column containing the values corresponding to the specified columns.

It’s important to note that when using pivot_wider(), the new column names need to be enclosed in quotes.

seqdata <- counts |> 
  pivot_longer(cols = starts_with("GSM"), 
               names_to = "Sample", 
               values_to = "Count")

The cols = starts_with("GSM") command returns a vector of columns whose names starts with “GSM”. pivot_longer() will then transform the those columns into two new columns, denoted as “Sample” and “Count.” The parameter names_to = "Sample" indicates that the new column containing the specified columns (defined by cols) should be named “Sample,” while values_to = "Count" specifies that the new column containing the values should be named “Count.”

seqdata
Output
X gene_symbol Sample Count
ENSMUSG00000031805 Jak3 GSM1480291 82.22869
ENSMUSG00000031805 Jak3 GSM1480292 81.18604
ENSMUSG00000031805 Jak3 GSM1480293 36.13808
ENSMUSG00000031805 Jak3 GSM1480294 36.62751
ENSMUSG00000031805 Jak3 GSM1480295 12.35039
ENSMUSG00000031805 Jak3 GSM1480296 11.85235
ENSMUSG00000031805 Jak3 GSM1480297 10.59006
ENSMUSG00000031805 Jak3 GSM1480298 14.88337
ENSMUSG00000031805 Jak3 GSM1480299 7.57182
ENSMUSG00000031805 Jak3 GSM1480300 7.05763
ENSMUSG00000031805 Jak3 GSM1480301 3.16982
ENSMUSG00000031805 Jak3 GSM1480302 4.25451
ENSMUSG00000033909 Usp36 GSM1480291 88.22360
ENSMUSG00000033909 Usp36 GSM1480292 94.59098
ENSMUSG00000033909 Usp36 GSM1480293 76.11531
ENSMUSG00000033909 Usp36 GSM1480294 63.73863
ENSMUSG00000033909 Usp36 GSM1480295 27.10872
ENSMUSG00000033909 Usp36 GSM1480296 24.87050
ENSMUSG00000033909 Usp36 GSM1480297 95.67017
ENSMUSG00000033909 Usp36 GSM1480298 100.73912
ENSMUSG00000033909 Usp36 GSM1480299 78.07470
ENSMUSG00000033909 Usp36 GSM1480300 59.35009
ENSMUSG00000033909 Usp36 GSM1480301 36.36491
ENSMUSG00000033909 Usp36 GSM1480302 40.37057
ENSMUSG00000065524 Mir135a-2 GSM1480291 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480292 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480293 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480294 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480295 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480296 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480297 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480298 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480299 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480300 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480301 0.00000
ENSMUSG00000065524 Mir135a-2 GSM1480302 0.00000
ENSMUSG00000030374 Strn4 GSM1480291 94.26324
ENSMUSG00000030374 Strn4 GSM1480292 85.05204
ENSMUSG00000030374 Strn4 GSM1480293 59.54855
ENSMUSG00000030374 Strn4 GSM1480294 54.30684
ENSMUSG00000030374 Strn4 GSM1480295 24.42892
ENSMUSG00000030374 Strn4 GSM1480296 24.90936
ENSMUSG00000030374 Strn4 GSM1480297 125.00263
ENSMUSG00000030374 Strn4 GSM1480298 102.99289
ENSMUSG00000030374 Strn4 GSM1480299 110.64124
ENSMUSG00000030374 Strn4 GSM1480300 97.98130
ENSMUSG00000030374 Strn4 GSM1480301 77.96883
ENSMUSG00000030374 Strn4 GSM1480302 105.70093
ENSMUSG00000098547 Mir6921 GSM1480291 0.08948
ENSMUSG00000098547 Mir6921 GSM1480292 0.08404
ENSMUSG00000098547 Mir6921 GSM1480293 0.00000
ENSMUSG00000098547 Mir6921 GSM1480294 0.08459
ENSMUSG00000098547 Mir6921 GSM1480295 0.03884
ENSMUSG00000098547 Mir6921 GSM1480296 0.07772
ENSMUSG00000098547 Mir6921 GSM1480297 0.00000
ENSMUSG00000098547 Mir6921 GSM1480298 0.08505
ENSMUSG00000098547 Mir6921 GSM1480299 0.07726
ENSMUSG00000098547 Mir6921 GSM1480300 0.08255
ENSMUSG00000098547 Mir6921 GSM1480301 0.04403
ENSMUSG00000098547 Mir6921 GSM1480302 0.04727
ENSMUSG00000035629 Rubcn GSM1480291 37.84846
ENSMUSG00000035629 Rubcn GSM1480292 37.73554
ENSMUSG00000035629 Rubcn GSM1480293 21.32397
ENSMUSG00000035629 Rubcn GSM1480294 27.70326
ENSMUSG00000035629 Rubcn GSM1480295 30.13807
ENSMUSG00000035629 Rubcn GSM1480296 36.33424
ENSMUSG00000035629 Rubcn GSM1480297 33.84822
ENSMUSG00000035629 Rubcn GSM1480298 36.91075
ENSMUSG00000035629 Rubcn GSM1480299 27.46715
ENSMUSG00000035629 Rubcn GSM1480300 31.36722
ENSMUSG00000035629 Rubcn GSM1480301 56.00020
ENSMUSG00000035629 Rubcn GSM1480302 64.33764
ENSMUSG00000094053 Scgb2b7 GSM1480291 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480292 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480293 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480294 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480295 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480296 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480297 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480298 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480299 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480300 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480301 0.00000
ENSMUSG00000094053 Scgb2b7 GSM1480302 0.00000
ENSMUSG00000055491 Pprc1 GSM1480291 194.87930
ENSMUSG00000055491 Pprc1 GSM1480292 165.14551
ENSMUSG00000055491 Pprc1 GSM1480293 159.82546
ENSMUSG00000055491 Pprc1 GSM1480294 167.99592
ENSMUSG00000055491 Pprc1 GSM1480295 75.69470
ENSMUSG00000055491 Pprc1 GSM1480296 82.42238
ENSMUSG00000055491 Pprc1 GSM1480297 281.97525
ENSMUSG00000055491 Pprc1 GSM1480298 241.83343
ENSMUSG00000055491 Pprc1 GSM1480299 227.84988
ENSMUSG00000055491 Pprc1 GSM1480300 232.81909
ENSMUSG00000055491 Pprc1 GSM1480301 166.85593
ENSMUSG00000055491 Pprc1 GSM1480302 150.56237
ENSMUSG00000053080 2700081O15Rik GSM1480291 65.94401
ENSMUSG00000053080 2700081O15Rik GSM1480292 78.91686
ENSMUSG00000053080 2700081O15Rik GSM1480293 44.77617
ENSMUSG00000053080 2700081O15Rik GSM1480294 35.69702
ENSMUSG00000053080 2700081O15Rik GSM1480295 8.66081
ENSMUSG00000053080 2700081O15Rik GSM1480296 9.98706
ENSMUSG00000053080 2700081O15Rik GSM1480297 62.62120
ENSMUSG00000053080 2700081O15Rik GSM1480298 61.82975
ENSMUSG00000053080 2700081O15Rik GSM1480299 49.98944
ENSMUSG00000053080 2700081O15Rik GSM1480300 42.01557
ENSMUSG00000053080 2700081O15Rik GSM1480301 37.24541
ENSMUSG00000053080 2700081O15Rik GSM1480302 40.08694
ENSMUSG00000039715 Wdr34 GSM1480291 18.20842
ENSMUSG00000039715 Wdr34 GSM1480292 18.95181
ENSMUSG00000039715 Wdr34 GSM1480293 11.60091
ENSMUSG00000039715 Wdr34 GSM1480294 8.41671
ENSMUSG00000039715 Wdr34 GSM1480295 3.72842
ENSMUSG00000039715 Wdr34 GSM1480296 3.69171
ENSMUSG00000039715 Wdr34 GSM1480297 17.98312
ENSMUSG00000039715 Wdr34 GSM1480298 13.69270
ENSMUSG00000039715 Wdr34 GSM1480299 9.00119
ENSMUSG00000039715 Wdr34 GSM1480300 9.69908
ENSMUSG00000039715 Wdr34 GSM1480301 10.38997
ENSMUSG00000039715 Wdr34 GSM1480302 11.10900
ENSMUSG00000033475 Tomm6 GSM1480291 53.46207
ENSMUSG00000033475 Tomm6 GSM1480292 49.03939
ENSMUSG00000033475 Tomm6 GSM1480293 50.82700
ENSMUSG00000033475 Tomm6 GSM1480294 49.95045
ENSMUSG00000033475 Tomm6 GSM1480295 47.49852
ENSMUSG00000033475 Tomm6 GSM1480296 46.04928
ENSMUSG00000033475 Tomm6 GSM1480297 51.79137
ENSMUSG00000033475 Tomm6 GSM1480298 49.49782
ENSMUSG00000033475 Tomm6 GSM1480299 52.84819
ENSMUSG00000033475 Tomm6 GSM1480300 47.00956
ENSMUSG00000033475 Tomm6 GSM1480301 37.11334
ENSMUSG00000033475 Tomm6 GSM1480302 35.97424
ENSMUSG00000026283 Ing5 GSM1480291 42.45649
ENSMUSG00000026283 Ing5 GSM1480292 45.46754
ENSMUSG00000026283 Ing5 GSM1480293 22.03338
ENSMUSG00000026283 Ing5 GSM1480294 22.45867
ENSMUSG00000026283 Ing5 GSM1480295 8.73849
ENSMUSG00000026283 Ing5 GSM1480296 11.11400
ENSMUSG00000026283 Ing5 GSM1480297 50.07299
ENSMUSG00000026283 Ing5 GSM1480298 50.94364
ENSMUSG00000026283 Ing5 GSM1480299 45.04458
ENSMUSG00000026283 Ing5 GSM1480300 36.89776
ENSMUSG00000026283 Ing5 GSM1480301 44.28946
ENSMUSG00000026283 Ing5 GSM1480302 46.70506
ENSMUSG00000037331 Larp1 GSM1480291 260.82331
ENSMUSG00000037331 Larp1 GSM1480292 246.03739
ENSMUSG00000037331 Larp1 GSM1480293 230.05686
ENSMUSG00000037331 Larp1 GSM1480294 220.23030
ENSMUSG00000037331 Larp1 GSM1480295 128.98005
ENSMUSG00000037331 Larp1 GSM1480296 120.11673
ENSMUSG00000037331 Larp1 GSM1480297 147.58143
ENSMUSG00000037331 Larp1 GSM1480298 128.20957
ENSMUSG00000037331 Larp1 GSM1480299 168.08661
ENSMUSG00000037331 Larp1 GSM1480300 133.35197
ENSMUSG00000037331 Larp1 GSM1480301 100.02551
ENSMUSG00000037331 Larp1 GSM1480302 98.32645
ENSMUSG00000074489 Bglap3 GSM1480291 185.84220
ENSMUSG00000074489 Bglap3 GSM1480292 188.25747
ENSMUSG00000074489 Bglap3 GSM1480293 148.76704
ENSMUSG00000074489 Bglap3 GSM1480294 163.80871
ENSMUSG00000074489 Bglap3 GSM1480295 51.34346
ENSMUSG00000074489 Bglap3 GSM1480296 48.84721
ENSMUSG00000074489 Bglap3 GSM1480297 52.83040
ENSMUSG00000074489 Bglap3 GSM1480298 84.66509
ENSMUSG00000074489 Bglap3 GSM1480299 323.54064
ENSMUSG00000074489 Bglap3 GSM1480300 485.86178
ENSMUSG00000074489 Bglap3 GSM1480301 197.27742
ENSMUSG00000074489 Bglap3 GSM1480302 163.89317
ENSMUSG00000038246 Fam50b GSM1480291 0.04474
ENSMUSG00000038246 Fam50b GSM1480292 0.00000
ENSMUSG00000038246 Fam50b GSM1480293 0.00000
ENSMUSG00000038246 Fam50b GSM1480294 0.00000
ENSMUSG00000038246 Fam50b GSM1480295 0.03884
ENSMUSG00000038246 Fam50b GSM1480296 0.03886
ENSMUSG00000038246 Fam50b GSM1480297 0.07992
ENSMUSG00000038246 Fam50b GSM1480298 0.00000
ENSMUSG00000038246 Fam50b GSM1480299 0.00000
ENSMUSG00000038246 Fam50b GSM1480300 0.00000
ENSMUSG00000038246 Fam50b GSM1480301 0.00000
ENSMUSG00000038246 Fam50b GSM1480302 0.09454
ENSMUSG00000066189 Cacng3 GSM1480291 0.00000
ENSMUSG00000066189 Cacng3 GSM1480292 0.00000
ENSMUSG00000066189 Cacng3 GSM1480293 0.00000
ENSMUSG00000066189 Cacng3 GSM1480294 0.04230
ENSMUSG00000066189 Cacng3 GSM1480295 0.00000
ENSMUSG00000066189 Cacng3 GSM1480296 0.00000
ENSMUSG00000066189 Cacng3 GSM1480297 0.00000
ENSMUSG00000066189 Cacng3 GSM1480298 0.00000
ENSMUSG00000066189 Cacng3 GSM1480299 0.03863
ENSMUSG00000066189 Cacng3 GSM1480300 0.00000
ENSMUSG00000066189 Cacng3 GSM1480301 0.00000
ENSMUSG00000066189 Cacng3 GSM1480302 0.00000
ENSMUSG00000005611 Mrvi1 GSM1480291 0.67107
ENSMUSG00000005611 Mrvi1 GSM1480292 0.92448
ENSMUSG00000005611 Mrvi1 GSM1480293 0.29211
ENSMUSG00000005611 Mrvi1 GSM1480294 0.21148
ENSMUSG00000005611 Mrvi1 GSM1480295 0.38838
ENSMUSG00000005611 Mrvi1 GSM1480296 0.03886
ENSMUSG00000005611 Mrvi1 GSM1480297 102.26398
ENSMUSG00000005611 Mrvi1 GSM1480298 81.05056
ENSMUSG00000005611 Mrvi1 GSM1480299 115.31568
ENSMUSG00000005611 Mrvi1 GSM1480300 138.38724
ENSMUSG00000005611 Mrvi1 GSM1480301 177.99434
ENSMUSG00000005611 Mrvi1 GSM1480302 185.16572
ENSMUSG00000064299 4921528I07Rik GSM1480291 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480292 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480293 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480294 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480295 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480296 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480297 0.03996
ENSMUSG00000064299 4921528I07Rik GSM1480298 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480299 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480300 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480301 0.00000
ENSMUSG00000064299 4921528I07Rik GSM1480302 0.00000
ENSMUSG00000028174 Rpe65 GSM1480291 0.00000
ENSMUSG00000028174 Rpe65 GSM1480292 0.00000
ENSMUSG00000028174 Rpe65 GSM1480293 0.00000
ENSMUSG00000028174 Rpe65 GSM1480294 0.00000
ENSMUSG00000028174 Rpe65 GSM1480295 0.00000
ENSMUSG00000028174 Rpe65 GSM1480296 0.00000
ENSMUSG00000028174 Rpe65 GSM1480297 0.00000
ENSMUSG00000028174 Rpe65 GSM1480298 0.00000
ENSMUSG00000028174 Rpe65 GSM1480299 0.00000
ENSMUSG00000028174 Rpe65 GSM1480300 0.00000
ENSMUSG00000028174 Rpe65 GSM1480301 0.00000
ENSMUSG00000028174 Rpe65 GSM1480302 0.00000
ENSMUSG00000024902 Mrpl11 GSM1480291 27.60343
ENSMUSG00000024902 Mrpl11 GSM1480292 23.91038
ENSMUSG00000024902 Mrpl11 GSM1480293 21.86646
ENSMUSG00000024902 Mrpl11 GSM1480294 18.56753
ENSMUSG00000024902 Mrpl11 GSM1480295 11.84550
ENSMUSG00000024902 Mrpl11 GSM1480296 11.30830
ENSMUSG00000024902 Mrpl11 GSM1480297 14.02683
ENSMUSG00000024902 Mrpl11 GSM1480298 11.56650
ENSMUSG00000024902 Mrpl11 GSM1480299 12.13036
ENSMUSG00000024902 Mrpl11 GSM1480300 12.21671
ENSMUSG00000024902 Mrpl11 GSM1480301 10.21387
ENSMUSG00000024902 Mrpl11 GSM1480302 10.06901

Alternatively, we could achieve the same outcome by specifying a column range using the following command:

seqdata <- counts |> 
  pivot_longer(cols = GSM1480291:GSM1480302, 
               names_to = "Sample", 
               values_to = "Count")

We can also specify the columns we don’t want to reformat, and pivot_longer() will then reformat all the columns except those. To achieve this, we place a minus sign (“-”) in front of the column names that we wish to exclude. This is a commonly used approach with pivot_longer(), as it can be more convenient to exclude columns we don’t need rather than explicitly include the ones we want.

seqdata <- counts |> 
  pivot_longer(cols = -c(X, gene_symbol), 
               names_to = "Sample", 
               values_to = "Count")

Converting data from long to wide format

First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).

annot <- read_csv("data/GSE60450_annotation.csv")
head(annot)
Output
ENSEMBL Type Annotation
ENSMUSG00000031805 SYMBOL Jak3
ENSMUSG00000031805 GENENAME Janus kinase 3
ENSMUSG00000033909 SYMBOL Usp36
ENSMUSG00000033909 GENENAME ubiquitin specific peptidase 36
ENSMUSG00000065524 SYMBOL Mir135a-2
ENSMUSG00000065524 GENENAME microRNA 135a-2

To transform this table so that it conforms to the tidy principles, we use the pivot_wider() function.

This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.

The pivot_wider() function takes two arguments:

  1. names_from = : a name or a vector of names of column(s) containing the labels that will be transformed into the new column names.
  2. values_from = : a name or a vector of names of column(s) containing the values that will fill the new columns.

In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:

annot_tidy <- annot |> 
  pivot_wider(names_from = Type, 
              values_from = Annotation)

The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot_wider() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.

annot_tidy
Output
ENSEMBL SYMBOL GENENAME
ENSMUSG00000031805 Jak3 Janus kinase 3
ENSMUSG00000033909 Usp36 ubiquitin specific peptidase 36
ENSMUSG00000065524 Mir135a-2 microRNA 135a-2
ENSMUSG00000030374 Strn4 striatin, calmodulin binding protein 4
ENSMUSG00000098547 Mir6921 microRNA 6921
ENSMUSG00000035629 Rubcn RUN domain and cysteine-rich domain containing, Beclin 1-interacting protein
ENSMUSG00000094053 Scgb2b7 secretoglobin, family 2B, member 7
ENSMUSG00000055491 Pprc1 peroxisome proliferative activated receptor, gamma, coactivator-related 1
ENSMUSG00000053080 Zfta zinc finger translocation associated
ENSMUSG00000039715 Dync2i2 dynein 2 intermediate chain 2
ENSMUSG00000033475 Tomm6 translocase of outer mitochondrial membrane 6
ENSMUSG00000026283 Ing5 inhibitor of growth family, member 5
ENSMUSG00000037331 Larp1 La ribonucleoprotein 1, translational regulator
ENSMUSG00000074489 Bglap3 bone gamma-carboxyglutamate protein 3
ENSMUSG00000038246 Fam50b family with sequence similarity 50, member B
ENSMUSG00000066189 Cacng3 calcium channel, voltage-dependent, gamma subunit 3
ENSMUSG00000005611 Irag1 inositol 1,4,5-triphosphate receptor associated 1
ENSMUSG00000064299 4921528I07Rik RIKEN cDNA 4921528I07 gene
ENSMUSG00000028174 Rpe65 retinal pigment epithelium 65
ENSMUSG00000024902 Mrpl11 mitochondrial ribosomal protein L11

It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the names_from column, the output will indeed become wider compared to the input.

Separating Columns

First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).

metadata <- read_csv("data/GSE60450_metadata.csv")
head(metadata)
Output
gene_id characteristics
GSM1480291 mammary gland;luminal cells;virgin
GSM1480292 mammary gland;luminal cells;virgin
GSM1480293 mammary gland;luminal cells;18.5 day pregnancy
GSM1480294 mammary gland;luminal cells;18.5 day pregnancy
GSM1480295 mammary gland;luminal cells;2 day lactation
GSM1480296 mammary gland;luminal cells;2 day lactation

To transform this table so that it conforms to the tidy principles, we use the separate_wider_position()/separate_wider_delim() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.

The separate_wider_delim() function takes three arguments:

  1. cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
  2. delim = : delimeter (or separator) between values. This is same as the delim = in read_delim().
  3. names = : a vector containing column names for the the new columns.

To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the separate_wider_delim() function:

metadata_lform <- metadata |> 
  separate_wider_delim(cols = characteristics, 
                       delim =";",
                       names = c("tissue_type", "immunophenotype", "development_stage"))
metadata_lform
Output
gene_id tissue_type immunophenotype development_stage
GSM1480291 mammary gland luminal cells virgin
GSM1480292 mammary gland luminal cells virgin
GSM1480293 mammary gland luminal cells 18.5 day pregnancy
GSM1480294 mammary gland luminal cells 18.5 day pregnancy
GSM1480295 mammary gland luminal cells 2 day lactation
GSM1480296 mammary gland luminal cells 2 day lactation
GSM1480297 mammary gland basal cells virgin
GSM1480298 mammary gland basal cells virgin
GSM1480299 mammary gland basal cells 18.5 day pregnancy
GSM1480300 mammary gland basal cells 18.5 day pregnancy
GSM1480301 mammary gland basal cells 2 day lactation
GSM1480302 mammary gland basal cells 2 day lactation

The separate_wider_position() function splits at fixed widths and takes two arguments:

  1. cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
  2. widths = : a named vector containing numbers where the names become the new column names and values specify the column widths.

For instance, we can divide the gene_id column into three separate columns to evaluate the functionality of this operation (this is provided purely as an example):

metadata_lform |> 
  separate_wider_position(cols = gene_id, 
                          widths = c(code = 3, prefix = 4, id = 3))
Output
code prefix id tissue_type immunophenotype development_stage
GSM 1480 291 mammary gland luminal cells virgin
GSM 1480 292 mammary gland luminal cells virgin
GSM 1480 293 mammary gland luminal cells 18.5 day pregnancy
GSM 1480 294 mammary gland luminal cells 18.5 day pregnancy
GSM 1480 295 mammary gland luminal cells 2 day lactation
GSM 1480 296 mammary gland luminal cells 2 day lactation
GSM 1480 297 mammary gland basal cells virgin
GSM 1480 298 mammary gland basal cells virgin
GSM 1480 299 mammary gland basal cells 18.5 day pregnancy
GSM 1480 300 mammary gland basal cells 18.5 day pregnancy
GSM 1480 301 mammary gland basal cells 2 day lactation
GSM 1480 302 mammary gland basal cells 2 day lactation

Uniting Columns

The unite() function is the complement of separate(). Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.

The unite() function takes three arguments:

  1. col = : name of the new column that will contain the united values.
  2. … = : a vector containing column names to unite.
  3. sep = : delimeter (or separator) this is same as the delim = in read_delim(). If we don’t specify a separator to insert between the combined values, they will be separated by _ (underscores).

To separate characteristic column in the metadata data frame into three separate columns:

metadata_lform |> 
  unite(col = characteristics, 
        tissue_type, immunophenotype, development_stage,
        sep = ",")
Output
gene_id characteristics
GSM1480291 mammary gland,luminal cells,virgin
GSM1480292 mammary gland,luminal cells,virgin
GSM1480293 mammary gland,luminal cells,18.5 day pregnancy
GSM1480294 mammary gland,luminal cells,18.5 day pregnancy
GSM1480295 mammary gland,luminal cells,2 day lactation
GSM1480296 mammary gland,luminal cells,2 day lactation
GSM1480297 mammary gland,basal cells,virgin
GSM1480298 mammary gland,basal cells,virgin
GSM1480299 mammary gland,basal cells,18.5 day pregnancy
GSM1480300 mammary gland,basal cells,18.5 day pregnancy
GSM1480301 mammary gland,basal cells,2 day lactation
GSM1480302 mammary gland,basal cells,2 day lactation

Missing Values

A value can be missing in one of two possible ways:

  • Explicitly, meaning it is flagged with NA.
  • Implicitly, implying that it is just not present in the data.

Let’s illustrate this idea with a very simple data frame:

covid_vac <- data.frame(
    year = c(2020, 2020, 2021, 2021, 2021,  2023, 2023,
             2023, 2024, 2024), 
    vaccine_type = c("Pfizer", "Moderna", "Pfizer", "Moderna", "Novavax", 
                   "Pfizer", "Moderna", 
                  "Novavax", "Moderna", NA),
    count = c(0, 3, 63, 88, 51,
               38, 19,
              5, 9, 7)
)
covid_vac
Output
year vaccine_type count
2020 Pfizer 0
2020 Moderna 3
2021 Pfizer 63
2021 Moderna 88
2021 Novavax 51
2023 Pfizer 38
2023 Moderna 19
2023 Novavax 5
2024 Moderna 9
2024 NA 7

In this dataset, we identify two occurrences of missing values:

  1. The vaccine_type in 2024 with a count of 7 is explicitly missing, denoted by the presence of NA in the cell where its value should be.
  2. The counts for the Novavax vaccine in 2020 and Pfizer, Novavax vaccines in 2024, are implicitly missing, as they do not appear in the dataset at all.

is.na()

To identify missing values we can use is.na() function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA.

is.na(covid_vac)
Output
       year vaccine_type count
 [1,] FALSE        FALSE FALSE
 [2,] FALSE        FALSE FALSE
 [3,] FALSE        FALSE FALSE
 [4,] FALSE        FALSE FALSE
 [5,] FALSE        FALSE FALSE
 [6,] FALSE        FALSE FALSE
 [7,] FALSE        FALSE FALSE
 [8,] FALSE        FALSE FALSE
 [9,] FALSE        FALSE FALSE
[10,] FALSE         TRUE FALSE
is.na(covid_vac$vaccine_type)
Output
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

To identify the location or the number of NAs we can use the which() and sum() functions:

which(is.na(covid_vac))
sum(is.na(covid_vac))
Output
[1] 20
[1] 1

na.omit()

To omit all rows containing missing values, we can use na.omit() function in base R:

na.omit(covid_vac)
Output
year vaccine_type count
2020 Pfizer 0
2020 Moderna 3
2021 Pfizer 63
2021 Moderna 88
2021 Novavax 51
2023 Pfizer 38
2023 Moderna 19
2023 Novavax 5
2024 Moderna 9

complete()

We can use the complete() function to make our dataset more complete or to make missing values explicit in tidy data:

covid_vac |> complete(year, vaccine_type)
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax NA
2020 Pfizer 0
2020 NA NA
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2021 NA NA
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2023 NA NA
2024 Moderna 9
2024 Novavax NA
2024 Pfizer NA
2024 NA 7

This function add missing values for potential combinations of year and vaccine_type. One problem is that R assumes NA in status as one of the combinations. To fix this, we can specify the labels of status to be considered as follows:

covid_vac |> complete(year, vaccine_type = c("Pfizer", "Moderna", "Novavax"))
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax NA
2020 Pfizer 0
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2024 Moderna 9
2024 Novavax NA
2024 Pfizer NA
2024 NA 7

We can use the fill argument to assign the fill value:

covid_vac |> complete(year, 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax 0
2020 Pfizer 0
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2024 Moderna 9
2024 Novavax 0
2024 Pfizer 0
2024 NA 7

We can use the full_seq() function from tidyr to fill out the data frame with all years from 2020 to 2024 and assign vaccination types and count values of 0 to those years and for which there was no observation.

covid_vac |> complete(year = full_seq(year, period = 1), 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax 0
2020 Pfizer 0
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2022 Moderna 0
2022 Novavax 0
2022 Pfizer 0
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2024 Moderna 9
2024 Novavax 0
2024 Pfizer 0
2024 NA 7

fill()

The fill() function is used to fill missing values in a data frame, particularly within columns.

Let’s first make missing values in the covid_vac dataset explicit and assign it to a data frame named covid_vac_comp.

covid_vac_comp <-  covid_vac |> 
  complete(year = full_seq(year, period = 1), 
           vaccine_type = c("Pfizer", "Moderna", "Novavax"))

We can specify the direction to fill the missing values using the argument .direction. Remember to specify the list of columns to fill.

covid_vac_comp |> fill(count, .direction = "down")
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax 3
2020 Pfizer 0
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2022 Moderna 63
2022 Novavax 63
2022 Pfizer 63
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2024 Moderna 9
2024 Novavax 9
2024 Pfizer 9
2024 NA 7

Similarly, we can fill upwards as follows:

covid_vac_comp |> fill(count, .direction = "up")
Output
year vaccine_type count
2020 Moderna 3
2020 Novavax 0
2020 Pfizer 0
2021 Moderna 88
2021 Novavax 51
2021 Pfizer 63
2022 Moderna 19
2022 Novavax 19
2022 Pfizer 19
2023 Moderna 19
2023 Novavax 5
2023 Pfizer 38
2024 Moderna 9
2024 Novavax 7
2024 Pfizer 7
2024 NA 7

Once the data is structured and organized according to tidy principles, we can begin manipulating and transforming it. The next section illustrates how this can be accomplished using the dplyr package from the tidyverse package suit.