Into the Tidyverse

Common tasks in working with data include actions like removing rows or columns, performing calculations, or adding new columns. This sort of operations is known as data manipulation. It is the process of cleaning, organizing, and transforming raw data into a more structured and usable format for analysis.

In this workshop, we’ll guide you through the process of data manipulation in R, starting with the tidyverse. The tidyverse is a collection of packages that align with a data science philosophy developed by Hadley Wickham and the RStudio team. Many users find it to be a more intuitive way to grasp R concepts. Although some tasks may be more straightforward in base-R, we’ll also highlight those aspects.

If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")), let’s proceed to load it into our R session:

library(tidyverse)

The following are key techniques outlined in Hadley Wickham and Garrett Grolemund’s book, R for Data Science:

This workshop focuses on:

Import: readr to import data.
Tidy: tidyr to organize rows of data into unique values.
Transform: dplyr to perform data manipulation, involving tasks such as subsetting by rows or columns, sorting, and joining.
Visualize: ggplot2 to create static plots, applying the principles of the grammar of graphics.

This workflow is of utmost importance. Instead of constructing analyses based on the unconventional format of your data, take measures to tidy up your data. Tidy data enables the use of number of analytical and visualization tools. It eliminates the need to develop ad-hoc methods to accommodate your data. This not only saves time but also enhances the clarity and comprehensibility of your work, benefiting both your collaborators and, most importantly, your future self.

Step 1: Importing data

First we need to import data into our R session. This can be achieved either by using datasets bundled with R packages or by importing external data into our workspace for data manipulation. This is the first step in the tidyverse workflow.

Preloaded Data in R Packages

A great way to learn data science tools is using the data provided by R packages. The data() function is a convenient way to explore and import pre-loaded datasets that come bundled with the R environment.

library(tidyverse)
data()

View Output

To import pre-loaded datasets bundled with a particular package (e.g., tidyr):

data(package = "tidyr")

View Output

When you load a package the pre-loaded datasets automatically imported into you R environment. Therefore you can access them directly as follows.

household

Output

family	dob_child1	dob_child2	name_child1	name_child2
1	1998-11-26	2000-01-29	Susan	Jose
2	1996-06-22	NA	Mark	NA
3	2002-07-11	2004-04-05	Sam	Seth
4	2004-10-10	2009-08-27	Craig	Khai
5	2000-12-05	2005-02-28	Parker	Gracie

?household

View Output

<!DOCTYPE html> R: Household data

household

R Documentation

Household data

Description

This dataset is based on an example in vignette(“datatable-reshape”, package = “data.table”)

Usage

household

Format

A data frame with 5 rows and 5 columns:

family: Family identifier
dob_child1: Date of birth of first child
dob_child2: Date of birth of second child
name_child1: Name of first child

name_child2: Name of second child

Reading the Data

At some point, you want to apply what you’ve learned to your own data. In this section, you’ll learn the basics of reading data files into R using the readr package. The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data.

We will use the read_csv() function from readr package to import a dataset. (See also read.csv() in base R.) CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.

The first argument of the read_csv() function takes the path to the file (or a web link). The following code will work if the cms_hospital_patient_satisfaction_2016_sampled.csv file is in the data/patient_satisfaction path.

# here the first argument is a path
cms_data <- read_csv("data/cms_hospital_patient_satisfaction.csv")

Output

Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, Facility Name, County, Hospital Type
dbl (4): Star Rating, No of Surveys, Response Rate, Overall Rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# here the first argument is a web link
mtvcars <- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")

Output

Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This command prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).

Other file types

Once you’ve mastered read_csv(), using other functions in the readr package is straightforward. It’s just a matter of knowing which function to use:

read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker or thousands seperator.
read_tsv()reads tab-delimited files.
read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.

Exploring the Data

In the previous section we imported a dataset that is bundled with the tidyr package, into a dataframe named household. In the next section we read in a CSV file and created a data frame named cms_data. This section demonstrates different ways to get to know these two data objects.

The class() function is used to identify the data type or data structure of an object (or variable):

class(household)
class(cms_data)

Output

[1] "tbl_df"     "tbl"        "data.frame"

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:

cms_data

Output

ID	Facility Name	County	Hospital Type	Star Rating	No of Surveys	Response Rate	Overall Rating
050424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
040062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3
151317	GREENE COUNTY GENERAL HOSPITAL	GREENE	Critical Access Hospital	3	114	22	3
061327	SOUTHWEST MEMORIAL HOSPITAL	MONTEZUMA	Critical Access Hospital	4	247	34	3
490057	SENTARA GENERAL HOSPITAL	VIRGINIA BEACH	Acute Care Hospital	4	619	32	3
110215	PIEDMONT FAYETTE HOSPITAL	FAYETTE	Acute Care Hospital	2	1714	21	2
050704	MISSION COMMUNITY HOSPITAL	LOS ANGELES	Acute Care Hospital	3	241	14	3
100296	DOCTORS HOSPITAL	MIAMI-DADE	Acute Care Hospital	4	393	24	3
440003	SUMNER REGIONAL MEDICAL CENTER	SUMNER	Acute Care Hospital	4	680	35	2
501339	WHIDBEY GENERAL HOSPITAL	ISLAND	Critical Access Hospital	3	389	29	3
050116	NORTHRIDGE MEDICAL CENTER	LOS ANGELES	Acute Care Hospital	3	1110	20	2

The dim() function prints the dimensions (rows x columns) of the data frame:

dim(cms_data)

Output

[1] 15  8

This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).

The nrow() function prints the number of rows while ncol() prints the number of columns:

nrow(cms_data)
ncol(cms_data)

Output

[1] 15

[1] 8

The View() function gives a spreadsheet-like view of the data frame:

View(cms_data)

By clicking the object on the environment tab also gives a spreadsheet-like view of the object:

The glimpse()function (dplyr package) displays a compact summary of the data frame, showing you key details such as the data types of each column, the first few values, and the total number of observations.

glimpse(cms_data)

Output

Rows: 15
Columns: 8
$ ID               <chr> "050424", "140103", "100051", "040062", "440048", "45…
$ `Facility Name`  <chr> "SCRIPPS GREEN HOSPITAL", "ST BERNARD HOSPITAL", "SOU…
$ County           <chr> "SAN DIEGO", "COOK", "LAKE", "SEBASTIAN", "SHELBY", "…
$ `Hospital Type`  <chr> "Acute Care Hospital", "Acute Care Hospital", "Acute …
$ `Star Rating`    <dbl> 4, 1, 2, 3, 2, 3, 3, 4, 4, 2, 3, 4, 4, 3, 3
$ `No of Surveys`  <dbl> 3110, 264, 1382, 2506, 1799, 1379, 114, 247, 619, 171…
$ `Response Rate`  <dbl> 41, 6, 20, 35, 18, 24, 22, 34, 32, 21, 14, 24, 35, 29…
$ `Overall Rating` <dbl> 5, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2

The head() function prints the top 6 rows of a data frame:

head(cms_data)

Output

ID	Facility Name	County	Hospital Type	Star Rating	No of Surveys	Response Rate	Overall Rating
050424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
040062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3

Similarly, the tail() function prints the bottom 6 rows of the data frame:

tail(cms_data)

Output

ID	Facility Name	County	Hospital Type	Star Rating	No of Surveys	Response Rate	Overall Rating
110215	PIEDMONT FAYETTE HOSPITAL	FAYETTE	Acute Care Hospital	2	1714	21	2
050704	MISSION COMMUNITY HOSPITAL	LOS ANGELES	Acute Care Hospital	3	241	14	3
100296	DOCTORS HOSPITAL	MIAMI-DADE	Acute Care Hospital	4	393	24	3
440003	SUMNER REGIONAL MEDICAL CENTER	SUMNER	Acute Care Hospital	4	680	35	2
501339	WHIDBEY GENERAL HOSPITAL	ISLAND	Critical Access Hospital	3	389	29	3
050116	NORTHRIDGE MEDICAL CENTER	LOS ANGELES	Acute Care Hospital	3	1110	20	2

The colnames() function displays all the column names:

colnames(cms_data)

[1] "ID"             "Facility Name"  "County"         "Hospital Type" 
[5] "Star Rating"    "No of Surveys"  "Response Rate"  "Overall Rating"

The $ symbol allows access to individual columns. To display ‘Hospital Type’ column:

cms_data$Hospital Type

Error: <text>:1:19: unexpected symbol
1: cms_data$Hospital Type
                      ^

Since the column names contain spaces, they need to be enclosed within `` (backticks) for R to interpret it as a single variable:

cms_data$`Hospital Type` # or cms_data$"Hospital Type"

It is a good practice to rename all the columns with spaces into a format that R can interpret. Conventionally, _ (underscore) is used to separate words in column names and variables instead of (space). We can rename a single column using the rename() function:

cms_data <- rename(cms_data, Hospital_Type = "Hospital Type")
head(cms_data)

Output

ID	Facility Name	County	Hospital_Type	Star Rating	No of Surveys	Response Rate	Overall Rating
050424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
040062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3

Note: To modify the original cms_data object, it is essential to assign the renamed object to cms_data. Otherwise, a copy of the cms_data object is changed.

Or by directly assigning a new column name:

colnames(cms_data)[2] <- "Facility_Name"
head(cms_data)

Output

ID	Facility_Name	County	Hospital_Type	Star Rating	No of Surveys	Response Rate	Overall Rating
050424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
040062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3

Using the above methods to rename multiple columns just to replace space with underscore can be laborious. A quick way to replace all spaces in the column names with underscore is shown below.

Option 1:

# by string substituting _ in place of ' '(space)
colnames(cms_data) <- gsub(" ", "_", colnames(cms_data))
colnames(cms_data)

Output

[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star_Rating"    "No_of_Surveys"  "Response_Rate"  "Overall_Rating"

Here the gsub() function replace space with underscores in the column names of the cms_data. This new object needs to be assigned (<-) to colnames(cms_data) in order for the changes to be saved in the cms_data object.

Option 2:

Here we use the janitor package which contains functions to clean data. You first need to install and load the library before using it.

# using janitor package
library(janitor) # remember to install janitor: install.packages("janitor")
cms_data <- cms_data |> clean_names()
cms_data

Output

id	facility_name	county	hospital_type	star_rating	no_of_surveys	response_rate	overall_rating
050424	SCRIPPS GREEN HOSPITAL	SAN DIEGO	Acute Care Hospital	4	3110	41	5
140103	ST BERNARD HOSPITAL	COOK	Acute Care Hospital	1	264	6	2
100051	SOUTH LAKE HOSPITAL	LAKE	Acute Care Hospital	2	1382	20	2
040062	MERCY HOSPITAL FORT SMITH	SEBASTIAN	Acute Care Hospital	3	2506	35	3
440048	BAPTIST MEMORIAL HOSPITAL	SHELBY	Acute Care Hospital	2	1799	18	2
450011	ST JOSEPH REGIONAL HEALTH CENTER	BRAZOS	Acute Care Hospital	3	1379	24	3
151317	GREENE COUNTY GENERAL HOSPITAL	GREENE	Critical Access Hospital	3	114	22	3
061327	SOUTHWEST MEMORIAL HOSPITAL	MONTEZUMA	Critical Access Hospital	4	247	34	3
490057	SENTARA GENERAL HOSPITAL	VIRGINIA BEACH	Acute Care Hospital	4	619	32	3
110215	PIEDMONT FAYETTE HOSPITAL	FAYETTE	Acute Care Hospital	2	1714	21	2
050704	MISSION COMMUNITY HOSPITAL	LOS ANGELES	Acute Care Hospital	3	241	14	3
100296	DOCTORS HOSPITAL	MIAMI-DADE	Acute Care Hospital	4	393	24	3
440003	SUMNER REGIONAL MEDICAL CENTER	SUMNER	Acute Care Hospital	4	680	35	2
501339	WHIDBEY GENERAL HOSPITAL	ISLAND	Critical Access Hospital	3	389	29	3
050116	NORTHRIDGE MEDICAL CENTER	LOS ANGELES	Acute Care Hospital	3	1110	20	2

Here we are sending the cms_data data frame into the function clean_names() which replaces replaces spaces with underscore in column names. Note that it also convert all characters in column names to lower case.

The default format is snake case styled as snake_case. You can specify other formats like camel case, title case.

Option 3:

Additionally, we can use the make.names() function in base R which replaces ’ ’ (spaces) with . (dots).

# using make.names() function
colnames(cms_data) <-  make.names(colnames(cms_data))
colnames(cms_data)

Output

[1] "ID"             "Facility_Name"  "County"         "Hospital_Type" 
[5] "Star.Rating"    "No.of.Surveys"  "Response.Rate"  "Overall.Rating"

The str() function shows the structure of the data:

str(cms_data)

Output

spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID            : chr [1:15] "050424" "140103" "100051" "040062" ...
 $ Facility_Name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
 $ County        : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
 $ Hospital_Type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
 $ Star Rating   : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
 $ No of Surveys : num [1:15] 3110 264 1382 2506 1799 ...
 $ Response Rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
 $ Overall Rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_character(),
  ..   `Facility Name` = col_character(),
  ..   County = col_character(),
  ..   `Hospital Type` = col_character(),
  ..   `Star Rating` = col_double(),
  ..   `No of Surveys` = col_double(),
  ..   `Response Rate` = col_double(),
  ..   `Overall Rating` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

The summary() function generates summary statistics:

summary(cms_data)

Output

      ID            Facility_Name         County          Hospital_Type     
 Length:15          Length:15          Length:15          Length:15         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Star Rating  No of Surveys    Response Rate Overall Rating 
 Min.   :1.0   Min.   : 114.0   Min.   : 6    Min.   :2.000  
 1st Qu.:2.5   1st Qu.: 326.5   1st Qu.:20    1st Qu.:2.000  
 Median :3.0   Median : 680.0   Median :24    Median :3.000  
 Mean   :3.0   Mean   :1063.1   Mean   :25    Mean   :2.733  
 3rd Qu.:4.0   3rd Qu.:1548.0   3rd Qu.:33    3rd Qu.:3.000  
 Max.   :4.0   Max.   :3110.0   Max.   :41    Max.   :5.000

A statitical overview can be obtained using the skim() function in skimr package:

library(skimr)
skim(cms_data)

Output

Data summary
Name	cms_data
Number of rows	15
Number of columns	8
_______________________
Column type frequency:
character	4
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ID	1	6	6	15
Facility_Name	1	16	32	15
County	1	4	14	14
Hospital_Type	1	19	24	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Star Rating	1	3.00	0.93	1	2.5	3	4	4	▁▃▁▇▇
No of Surveys	1	1063.13	909.05	114	326.5	680	1548	3110	▇▁▃▁▁
Response Rate	1	25.00	9.30	6	20.0	24	33	41	▂▇▇▆▆
Overall Rating	1	2.73	0.80	2	2.0	3	3	5	▆▇▁▁▁

Writing Data to a File

Writing data to a file is a fundamental operation in programming and data analysis. It involves taking data from within a program or environment and storing it in a file on a disk for later use or sharing. This section explains the basics of writing a data file using the readr package.

The write_csv() and write_tsv() functions are part of the readr package, which is designed for writing delimited files like CSV (comma-separated values) and TSV (tab-separated values). These functions are used to write data frames into CSV and TSV files, respectively.

We first provide the variable name of the data frame followed by the file name (ideally including the full folder location).

To write a CSV file:

# on Mac:
write_csv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_csv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

To write a TSV file:

# on Mac:
write_tsv(cms_data, "~/Desktop/cms_data.csv")

# on Windows
write_tsv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")

Step 2: Tidy Data

Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, popularized by Hadley Wickham, are designed to promote consistency and ease of use in data analysis.

This is the second step in the tidyverse workflow.

Let’s take a look at some examples.

Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.

For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.

AirPassengers

Output

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?

Data needs to be reshaped to conform to the tidy data structure. It involves using four primary verbs (or pairs of opposites):

Convert columns into rows (pivot_longer()).
Convert rows into columns (pivot_wider()).
Convert a character column into multiple columns (separate_wider_delim() and separate_wider_position()).
Combine multiple character columns into a single column (unite()).

First, load the tidyr package. Since you have already installed the tidyverse, you should be able to load it directly as follows (otherwise install it using the command install.packages("tidyverse") if necessary):

library(tidyverse)

Converting data from wide to long format

First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).

counts <- read_csv("data/GSE60450_normalized_data.csv")
head(counts)

Output

X	gene_symbol	GSM1480291	GSM1480292	GSM1480293	GSM1480294	GSM1480295	GSM1480296	GSM1480297	GSM1480298	GSM1480299	GSM1480300	GSM1480301	GSM1480302
ENSMUSG00000031805	Jak3	82.22869	81.18604	36.13808	36.62751	12.35039	11.85235	10.59006	14.88337	7.57182	7.05763	3.16982	4.25451
ENSMUSG00000033909	Usp36	88.22360	94.59098	76.11531	63.73863	27.10872	24.87050	95.67017	100.73912	78.07470	59.35009	36.36491	40.37057
ENSMUSG00000065524	Mir135a-2	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
ENSMUSG00000030374	Strn4	94.26324	85.05204	59.54855	54.30684	24.42892	24.90936	125.00263	102.99289	110.64124	97.98130	77.96883	105.70093
ENSMUSG00000098547	Mir6921	0.08948	0.08404	0.00000	0.08459	0.03884	0.07772	0.00000	0.08505	0.07726	0.08255	0.04403	0.04727
ENSMUSG00000035629	Rubcn	37.84846	37.73554	21.32397	27.70326	30.13807	36.33424	33.84822	36.91075	27.46715	31.36722	56.00020	64.33764

To transform this table from a wide format to a long format, we use the pivot_longer() function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from pivot_longer() is often necessary for ggplot, where each aesthetic or facet category must be a single column of values and for left_join(), which will be introduced later.

This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.

The pivot_longer() function takes three arguments:

cols = : a vector indicating the names of the columns to be converted into labels in long form.
names_to = : a name or vector of names for the new column(s) containing the labels from the specified columns.
**values_to =* *: a name for the new column containing the values corresponding to the specified columns.

It’s important to note that when using pivot_wider(), the new column names need to be enclosed in quotes.

seqdata <- counts |> 
  pivot_longer(cols = starts_with("GSM"), 
               names_to = "Sample", 
               values_to = "Count")

The cols = starts_with("GSM") command returns a vector of columns whose names starts with “GSM”. pivot_longer() will then transform the those columns into two new columns, denoted as “Sample” and “Count.” The parameter names_to = "Sample" indicates that the new column containing the specified columns (defined by cols) should be named “Sample,” while values_to = "Count" specifies that the new column containing the values should be named “Count.”

seqdata

Output

X	gene_symbol	Sample	Count
ENSMUSG00000031805	Jak3	GSM1480291	82.22869
ENSMUSG00000031805	Jak3	GSM1480292	81.18604
ENSMUSG00000031805	Jak3	GSM1480293	36.13808
ENSMUSG00000031805	Jak3	GSM1480294	36.62751
ENSMUSG00000031805	Jak3	GSM1480295	12.35039
ENSMUSG00000031805	Jak3	GSM1480296	11.85235
ENSMUSG00000031805	Jak3	GSM1480297	10.59006
ENSMUSG00000031805	Jak3	GSM1480298	14.88337
ENSMUSG00000031805	Jak3	GSM1480299	7.57182
ENSMUSG00000031805	Jak3	GSM1480300	7.05763
ENSMUSG00000031805	Jak3	GSM1480301	3.16982
ENSMUSG00000031805	Jak3	GSM1480302	4.25451
ENSMUSG00000033909	Usp36	GSM1480291	88.22360
ENSMUSG00000033909	Usp36	GSM1480292	94.59098
ENSMUSG00000033909	Usp36	GSM1480293	76.11531
ENSMUSG00000033909	Usp36	GSM1480294	63.73863
ENSMUSG00000033909	Usp36	GSM1480295	27.10872
ENSMUSG00000033909	Usp36	GSM1480296	24.87050
ENSMUSG00000033909	Usp36	GSM1480297	95.67017
ENSMUSG00000033909	Usp36	GSM1480298	100.73912
ENSMUSG00000033909	Usp36	GSM1480299	78.07470
ENSMUSG00000033909	Usp36	GSM1480300	59.35009
ENSMUSG00000033909	Usp36	GSM1480301	36.36491
ENSMUSG00000033909	Usp36	GSM1480302	40.37057
ENSMUSG00000065524	Mir135a-2	GSM1480291	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480292	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480293	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480294	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480295	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480296	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480297	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480298	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480299	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480300	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480301	0.00000
ENSMUSG00000065524	Mir135a-2	GSM1480302	0.00000
ENSMUSG00000030374	Strn4	GSM1480291	94.26324
ENSMUSG00000030374	Strn4	GSM1480292	85.05204
ENSMUSG00000030374	Strn4	GSM1480293	59.54855
ENSMUSG00000030374	Strn4	GSM1480294	54.30684
ENSMUSG00000030374	Strn4	GSM1480295	24.42892
ENSMUSG00000030374	Strn4	GSM1480296	24.90936
ENSMUSG00000030374	Strn4	GSM1480297	125.00263
ENSMUSG00000030374	Strn4	GSM1480298	102.99289
ENSMUSG00000030374	Strn4	GSM1480299	110.64124
ENSMUSG00000030374	Strn4	GSM1480300	97.98130
ENSMUSG00000030374	Strn4	GSM1480301	77.96883
ENSMUSG00000030374	Strn4	GSM1480302	105.70093
ENSMUSG00000098547	Mir6921	GSM1480291	0.08948
ENSMUSG00000098547	Mir6921	GSM1480292	0.08404
ENSMUSG00000098547	Mir6921	GSM1480293	0.00000
ENSMUSG00000098547	Mir6921	GSM1480294	0.08459
ENSMUSG00000098547	Mir6921	GSM1480295	0.03884
ENSMUSG00000098547	Mir6921	GSM1480296	0.07772
ENSMUSG00000098547	Mir6921	GSM1480297	0.00000
ENSMUSG00000098547	Mir6921	GSM1480298	0.08505
ENSMUSG00000098547	Mir6921	GSM1480299	0.07726
ENSMUSG00000098547	Mir6921	GSM1480300	0.08255
ENSMUSG00000098547	Mir6921	GSM1480301	0.04403
ENSMUSG00000098547	Mir6921	GSM1480302	0.04727
ENSMUSG00000035629	Rubcn	GSM1480291	37.84846
ENSMUSG00000035629	Rubcn	GSM1480292	37.73554
ENSMUSG00000035629	Rubcn	GSM1480293	21.32397
ENSMUSG00000035629	Rubcn	GSM1480294	27.70326
ENSMUSG00000035629	Rubcn	GSM1480295	30.13807
ENSMUSG00000035629	Rubcn	GSM1480296	36.33424
ENSMUSG00000035629	Rubcn	GSM1480297	33.84822
ENSMUSG00000035629	Rubcn	GSM1480298	36.91075
ENSMUSG00000035629	Rubcn	GSM1480299	27.46715
ENSMUSG00000035629	Rubcn	GSM1480300	31.36722
ENSMUSG00000035629	Rubcn	GSM1480301	56.00020
ENSMUSG00000035629	Rubcn	GSM1480302	64.33764
ENSMUSG00000094053	Scgb2b7	GSM1480291	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480292	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480293	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480294	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480295	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480296	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480297	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480298	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480299	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480300	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480301	0.00000
ENSMUSG00000094053	Scgb2b7	GSM1480302	0.00000
ENSMUSG00000055491	Pprc1	GSM1480291	194.87930
ENSMUSG00000055491	Pprc1	GSM1480292	165.14551
ENSMUSG00000055491	Pprc1	GSM1480293	159.82546
ENSMUSG00000055491	Pprc1	GSM1480294	167.99592
ENSMUSG00000055491	Pprc1	GSM1480295	75.69470
ENSMUSG00000055491	Pprc1	GSM1480296	82.42238
ENSMUSG00000055491	Pprc1	GSM1480297	281.97525
ENSMUSG00000055491	Pprc1	GSM1480298	241.83343
ENSMUSG00000055491	Pprc1	GSM1480299	227.84988
ENSMUSG00000055491	Pprc1	GSM1480300	232.81909
ENSMUSG00000055491	Pprc1	GSM1480301	166.85593
ENSMUSG00000055491	Pprc1	GSM1480302	150.56237
ENSMUSG00000053080	2700081O15Rik	GSM1480291	65.94401
ENSMUSG00000053080	2700081O15Rik	GSM1480292	78.91686
ENSMUSG00000053080	2700081O15Rik	GSM1480293	44.77617
ENSMUSG00000053080	2700081O15Rik	GSM1480294	35.69702
ENSMUSG00000053080	2700081O15Rik	GSM1480295	8.66081
ENSMUSG00000053080	2700081O15Rik	GSM1480296	9.98706
ENSMUSG00000053080	2700081O15Rik	GSM1480297	62.62120
ENSMUSG00000053080	2700081O15Rik	GSM1480298	61.82975
ENSMUSG00000053080	2700081O15Rik	GSM1480299	49.98944
ENSMUSG00000053080	2700081O15Rik	GSM1480300	42.01557
ENSMUSG00000053080	2700081O15Rik	GSM1480301	37.24541
ENSMUSG00000053080	2700081O15Rik	GSM1480302	40.08694
ENSMUSG00000039715	Wdr34	GSM1480291	18.20842
ENSMUSG00000039715	Wdr34	GSM1480292	18.95181
ENSMUSG00000039715	Wdr34	GSM1480293	11.60091
ENSMUSG00000039715	Wdr34	GSM1480294	8.41671
ENSMUSG00000039715	Wdr34	GSM1480295	3.72842
ENSMUSG00000039715	Wdr34	GSM1480296	3.69171
ENSMUSG00000039715	Wdr34	GSM1480297	17.98312
ENSMUSG00000039715	Wdr34	GSM1480298	13.69270
ENSMUSG00000039715	Wdr34	GSM1480299	9.00119
ENSMUSG00000039715	Wdr34	GSM1480300	9.69908
ENSMUSG00000039715	Wdr34	GSM1480301	10.38997
ENSMUSG00000039715	Wdr34	GSM1480302	11.10900
ENSMUSG00000033475	Tomm6	GSM1480291	53.46207
ENSMUSG00000033475	Tomm6	GSM1480292	49.03939
ENSMUSG00000033475	Tomm6	GSM1480293	50.82700
ENSMUSG00000033475	Tomm6	GSM1480294	49.95045
ENSMUSG00000033475	Tomm6	GSM1480295	47.49852
ENSMUSG00000033475	Tomm6	GSM1480296	46.04928
ENSMUSG00000033475	Tomm6	GSM1480297	51.79137
ENSMUSG00000033475	Tomm6	GSM1480298	49.49782
ENSMUSG00000033475	Tomm6	GSM1480299	52.84819
ENSMUSG00000033475	Tomm6	GSM1480300	47.00956
ENSMUSG00000033475	Tomm6	GSM1480301	37.11334
ENSMUSG00000033475	Tomm6	GSM1480302	35.97424
ENSMUSG00000026283	Ing5	GSM1480291	42.45649
ENSMUSG00000026283	Ing5	GSM1480292	45.46754
ENSMUSG00000026283	Ing5	GSM1480293	22.03338
ENSMUSG00000026283	Ing5	GSM1480294	22.45867
ENSMUSG00000026283	Ing5	GSM1480295	8.73849
ENSMUSG00000026283	Ing5	GSM1480296	11.11400
ENSMUSG00000026283	Ing5	GSM1480297	50.07299
ENSMUSG00000026283	Ing5	GSM1480298	50.94364
ENSMUSG00000026283	Ing5	GSM1480299	45.04458
ENSMUSG00000026283	Ing5	GSM1480300	36.89776
ENSMUSG00000026283	Ing5	GSM1480301	44.28946
ENSMUSG00000026283	Ing5	GSM1480302	46.70506
ENSMUSG00000037331	Larp1	GSM1480291	260.82331
ENSMUSG00000037331	Larp1	GSM1480292	246.03739
ENSMUSG00000037331	Larp1	GSM1480293	230.05686
ENSMUSG00000037331	Larp1	GSM1480294	220.23030
ENSMUSG00000037331	Larp1	GSM1480295	128.98005
ENSMUSG00000037331	Larp1	GSM1480296	120.11673
ENSMUSG00000037331	Larp1	GSM1480297	147.58143
ENSMUSG00000037331	Larp1	GSM1480298	128.20957
ENSMUSG00000037331	Larp1	GSM1480299	168.08661
ENSMUSG00000037331	Larp1	GSM1480300	133.35197
ENSMUSG00000037331	Larp1	GSM1480301	100.02551
ENSMUSG00000037331	Larp1	GSM1480302	98.32645
ENSMUSG00000074489	Bglap3	GSM1480291	185.84220
ENSMUSG00000074489	Bglap3	GSM1480292	188.25747
ENSMUSG00000074489	Bglap3	GSM1480293	148.76704
ENSMUSG00000074489	Bglap3	GSM1480294	163.80871
ENSMUSG00000074489	Bglap3	GSM1480295	51.34346
ENSMUSG00000074489	Bglap3	GSM1480296	48.84721
ENSMUSG00000074489	Bglap3	GSM1480297	52.83040
ENSMUSG00000074489	Bglap3	GSM1480298	84.66509
ENSMUSG00000074489	Bglap3	GSM1480299	323.54064
ENSMUSG00000074489	Bglap3	GSM1480300	485.86178
ENSMUSG00000074489	Bglap3	GSM1480301	197.27742
ENSMUSG00000074489	Bglap3	GSM1480302	163.89317
ENSMUSG00000038246	Fam50b	GSM1480291	0.04474
ENSMUSG00000038246	Fam50b	GSM1480292	0.00000
ENSMUSG00000038246	Fam50b	GSM1480293	0.00000
ENSMUSG00000038246	Fam50b	GSM1480294	0.00000
ENSMUSG00000038246	Fam50b	GSM1480295	0.03884
ENSMUSG00000038246	Fam50b	GSM1480296	0.03886
ENSMUSG00000038246	Fam50b	GSM1480297	0.07992
ENSMUSG00000038246	Fam50b	GSM1480298	0.00000
ENSMUSG00000038246	Fam50b	GSM1480299	0.00000
ENSMUSG00000038246	Fam50b	GSM1480300	0.00000
ENSMUSG00000038246	Fam50b	GSM1480301	0.00000
ENSMUSG00000038246	Fam50b	GSM1480302	0.09454
ENSMUSG00000066189	Cacng3	GSM1480291	0.00000
ENSMUSG00000066189	Cacng3	GSM1480292	0.00000
ENSMUSG00000066189	Cacng3	GSM1480293	0.00000
ENSMUSG00000066189	Cacng3	GSM1480294	0.04230
ENSMUSG00000066189	Cacng3	GSM1480295	0.00000
ENSMUSG00000066189	Cacng3	GSM1480296	0.00000
ENSMUSG00000066189	Cacng3	GSM1480297	0.00000
ENSMUSG00000066189	Cacng3	GSM1480298	0.00000
ENSMUSG00000066189	Cacng3	GSM1480299	0.03863
ENSMUSG00000066189	Cacng3	GSM1480300	0.00000
ENSMUSG00000066189	Cacng3	GSM1480301	0.00000
ENSMUSG00000066189	Cacng3	GSM1480302	0.00000
ENSMUSG00000005611	Mrvi1	GSM1480291	0.67107
ENSMUSG00000005611	Mrvi1	GSM1480292	0.92448
ENSMUSG00000005611	Mrvi1	GSM1480293	0.29211
ENSMUSG00000005611	Mrvi1	GSM1480294	0.21148
ENSMUSG00000005611	Mrvi1	GSM1480295	0.38838
ENSMUSG00000005611	Mrvi1	GSM1480296	0.03886
ENSMUSG00000005611	Mrvi1	GSM1480297	102.26398
ENSMUSG00000005611	Mrvi1	GSM1480298	81.05056
ENSMUSG00000005611	Mrvi1	GSM1480299	115.31568
ENSMUSG00000005611	Mrvi1	GSM1480300	138.38724
ENSMUSG00000005611	Mrvi1	GSM1480301	177.99434
ENSMUSG00000005611	Mrvi1	GSM1480302	185.16572
ENSMUSG00000064299	4921528I07Rik	GSM1480291	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480292	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480293	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480294	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480295	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480296	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480297	0.03996
ENSMUSG00000064299	4921528I07Rik	GSM1480298	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480299	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480300	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480301	0.00000
ENSMUSG00000064299	4921528I07Rik	GSM1480302	0.00000
ENSMUSG00000028174	Rpe65	GSM1480291	0.00000
ENSMUSG00000028174	Rpe65	GSM1480292	0.00000
ENSMUSG00000028174	Rpe65	GSM1480293	0.00000
ENSMUSG00000028174	Rpe65	GSM1480294	0.00000
ENSMUSG00000028174	Rpe65	GSM1480295	0.00000
ENSMUSG00000028174	Rpe65	GSM1480296	0.00000
ENSMUSG00000028174	Rpe65	GSM1480297	0.00000
ENSMUSG00000028174	Rpe65	GSM1480298	0.00000
ENSMUSG00000028174	Rpe65	GSM1480299	0.00000
ENSMUSG00000028174	Rpe65	GSM1480300	0.00000
ENSMUSG00000028174	Rpe65	GSM1480301	0.00000
ENSMUSG00000028174	Rpe65	GSM1480302	0.00000
ENSMUSG00000024902	Mrpl11	GSM1480291	27.60343
ENSMUSG00000024902	Mrpl11	GSM1480292	23.91038
ENSMUSG00000024902	Mrpl11	GSM1480293	21.86646
ENSMUSG00000024902	Mrpl11	GSM1480294	18.56753
ENSMUSG00000024902	Mrpl11	GSM1480295	11.84550
ENSMUSG00000024902	Mrpl11	GSM1480296	11.30830
ENSMUSG00000024902	Mrpl11	GSM1480297	14.02683
ENSMUSG00000024902	Mrpl11	GSM1480298	11.56650
ENSMUSG00000024902	Mrpl11	GSM1480299	12.13036
ENSMUSG00000024902	Mrpl11	GSM1480300	12.21671
ENSMUSG00000024902	Mrpl11	GSM1480301	10.21387
ENSMUSG00000024902	Mrpl11	GSM1480302	10.06901

Alternatively, we could achieve the same outcome by specifying a column range using the following command:

seqdata <- counts |> 
  pivot_longer(cols = GSM1480291:GSM1480302, 
               names_to = "Sample", 
               values_to = "Count")

We can also specify the columns we don’t want to reformat, and pivot_longer() will then reformat all the columns except those. To achieve this, we place a minus sign (“-”) in front of the column names that we wish to exclude. This is a commonly used approach with pivot_longer(), as it can be more convenient to exclude columns we don’t need rather than explicitly include the ones we want.

seqdata <- counts |> 
  pivot_longer(cols = -c(X, gene_symbol), 
               names_to = "Sample", 
               values_to = "Count")

Converting data from long to wide format

First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).

annot <- read_csv("data/GSE60450_annotation.csv")
head(annot)

Output

ENSEMBL	Type	Annotation
ENSMUSG00000031805	SYMBOL	Jak3
ENSMUSG00000031805	GENENAME	Janus kinase 3
ENSMUSG00000033909	SYMBOL	Usp36
ENSMUSG00000033909	GENENAME	ubiquitin specific peptidase 36
ENSMUSG00000065524	SYMBOL	Mir135a-2
ENSMUSG00000065524	GENENAME	microRNA 135a-2

To transform this table so that it conforms to the tidy principles, we use the pivot_wider() function.

This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.

The pivot_wider() function takes two arguments:

names_from = : a name or a vector of names of column(s) containing the labels that will be transformed into the new column names.
values_from = : a name or a vector of names of column(s) containing the values that will fill the new columns.

In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:

annot_tidy <- annot |> 
  pivot_wider(names_from = Type, 
              values_from = Annotation)

The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot_wider() results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.

annot_tidy

Output

ENSEMBL	SYMBOL	GENENAME
ENSMUSG00000031805	Jak3	Janus kinase 3
ENSMUSG00000033909	Usp36	ubiquitin specific peptidase 36
ENSMUSG00000065524	Mir135a-2	microRNA 135a-2
ENSMUSG00000030374	Strn4	striatin, calmodulin binding protein 4
ENSMUSG00000098547	Mir6921	microRNA 6921
ENSMUSG00000035629	Rubcn	RUN domain and cysteine-rich domain containing, Beclin 1-interacting protein
ENSMUSG00000094053	Scgb2b7	secretoglobin, family 2B, member 7
ENSMUSG00000055491	Pprc1	peroxisome proliferative activated receptor, gamma, coactivator-related 1
ENSMUSG00000053080	Zfta	zinc finger translocation associated
ENSMUSG00000039715	Dync2i2	dynein 2 intermediate chain 2
ENSMUSG00000033475	Tomm6	translocase of outer mitochondrial membrane 6
ENSMUSG00000026283	Ing5	inhibitor of growth family, member 5
ENSMUSG00000037331	Larp1	La ribonucleoprotein 1, translational regulator
ENSMUSG00000074489	Bglap3	bone gamma-carboxyglutamate protein 3
ENSMUSG00000038246	Fam50b	family with sequence similarity 50, member B
ENSMUSG00000066189	Cacng3	calcium channel, voltage-dependent, gamma subunit 3
ENSMUSG00000005611	Irag1	inositol 1,4,5-triphosphate receptor associated 1
ENSMUSG00000064299	4921528I07Rik	RIKEN cDNA 4921528I07 gene
ENSMUSG00000028174	Rpe65	retinal pigment epithelium 65
ENSMUSG00000024902	Mrpl11	mitochondrial ribosomal protein L11

It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the names_from column, the output will indeed become wider compared to the input.

Separating Columns

First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).

metadata <- read_csv("data/GSE60450_metadata.csv")
head(metadata)

Output

gene_id	characteristics
GSM1480291	mammary gland;luminal cells;virgin
GSM1480292	mammary gland;luminal cells;virgin
GSM1480293	mammary gland;luminal cells;18.5 day pregnancy
GSM1480294	mammary gland;luminal cells;18.5 day pregnancy
GSM1480295	mammary gland;luminal cells;2 day lactation
GSM1480296	mammary gland;luminal cells;2 day lactation

To transform this table so that it conforms to the tidy principles, we use the separate_wider_position()/separate_wider_delim() function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.

The separate_wider_delim() function takes three arguments:

cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
delim = : delimeter (or separator) between values. This is same as the delim = in read_delim().
names = : a vector containing column names for the the new columns.

To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the separate_wider_delim() function:

metadata_lform <- metadata |> 
  separate_wider_delim(cols = characteristics, 
                       delim =";",
                       names = c("tissue_type", "immunophenotype", "development_stage"))
metadata_lform

Output

gene_id	tissue_type	immunophenotype	development_stage
GSM1480291	mammary gland	luminal cells	virgin
GSM1480292	mammary gland	luminal cells	virgin
GSM1480293	mammary gland	luminal cells	18.5 day pregnancy
GSM1480294	mammary gland	luminal cells	18.5 day pregnancy
GSM1480295	mammary gland	luminal cells	2 day lactation
GSM1480296	mammary gland	luminal cells	2 day lactation
GSM1480297	mammary gland	basal cells	virgin
GSM1480298	mammary gland	basal cells	virgin
GSM1480299	mammary gland	basal cells	18.5 day pregnancy
GSM1480300	mammary gland	basal cells	18.5 day pregnancy
GSM1480301	mammary gland	basal cells	2 day lactation
GSM1480302	mammary gland	basal cells	2 day lactation

The separate_wider_position() function splits at fixed widths and takes two arguments:

cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
widths = : a named vector containing numbers where the names become the new column names and values specify the column widths.

For instance, we can divide the gene_id column into three separate columns to evaluate the functionality of this operation (this is provided purely as an example):

metadata_lform |> 
  separate_wider_position(cols = gene_id, 
                          widths = c(code = 3, prefix = 4, id = 3))

Output

code	prefix	id	tissue_type	immunophenotype	development_stage
GSM	1480	291	mammary gland	luminal cells	virgin
GSM	1480	292	mammary gland	luminal cells	virgin
GSM	1480	293	mammary gland	luminal cells	18.5 day pregnancy
GSM	1480	294	mammary gland	luminal cells	18.5 day pregnancy
GSM	1480	295	mammary gland	luminal cells	2 day lactation
GSM	1480	296	mammary gland	luminal cells	2 day lactation
GSM	1480	297	mammary gland	basal cells	virgin
GSM	1480	298	mammary gland	basal cells	virgin
GSM	1480	299	mammary gland	basal cells	18.5 day pregnancy
GSM	1480	300	mammary gland	basal cells	18.5 day pregnancy
GSM	1480	301	mammary gland	basal cells	2 day lactation
GSM	1480	302	mammary gland	basal cells	2 day lactation

Uniting Columns

The unite() function is the complement of separate(). Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.

The unite() function takes three arguments:

col = : name of the new column that will contain the united values.
… = : a vector containing column names to unite.
sep = : delimeter (or separator) this is same as the delim = in read_delim(). If we don’t specify a separator to insert between the combined values, they will be separated by _ (underscores).

To separate characteristic column in the metadata data frame into three separate columns:

metadata_lform |> 
  unite(col = characteristics, 
        tissue_type, immunophenotype, development_stage,
        sep = ",")

Output

gene_id	characteristics
GSM1480291	mammary gland,luminal cells,virgin
GSM1480292	mammary gland,luminal cells,virgin
GSM1480293	mammary gland,luminal cells,18.5 day pregnancy
GSM1480294	mammary gland,luminal cells,18.5 day pregnancy
GSM1480295	mammary gland,luminal cells,2 day lactation
GSM1480296	mammary gland,luminal cells,2 day lactation
GSM1480297	mammary gland,basal cells,virgin
GSM1480298	mammary gland,basal cells,virgin
GSM1480299	mammary gland,basal cells,18.5 day pregnancy
GSM1480300	mammary gland,basal cells,18.5 day pregnancy
GSM1480301	mammary gland,basal cells,2 day lactation
GSM1480302	mammary gland,basal cells,2 day lactation

Missing Values

A value can be missing in one of two possible ways:

Explicitly, meaning it is flagged with NA.
Implicitly, implying that it is just not present in the data.

Let’s illustrate this idea with a very simple data frame:

covid_vac <- data.frame(
    year = c(2020, 2020, 2021, 2021, 2021,  2023, 2023,
             2023, 2024, 2024), 
    vaccine_type = c("Pfizer", "Moderna", "Pfizer", "Moderna", "Novavax", 
                   "Pfizer", "Moderna", 
                  "Novavax", "Moderna", NA),
    count = c(0, 3, 63, 88, 51,
               38, 19,
              5, 9, 7)
)
covid_vac

Output

year	vaccine_type	count
2020	Pfizer	0
2020	Moderna	3
2021	Pfizer	63
2021	Moderna	88
2021	Novavax	51
2023	Pfizer	38
2023	Moderna	19
2023	Novavax	5
2024	Moderna	9
2024	NA	7

In this dataset, we identify two occurrences of missing values:

The vaccine_type in 2024 with a count of 7 is explicitly missing, denoted by the presence of NA in the cell where its value should be.
The counts for the Novavax vaccine in 2020 and Pfizer, Novavax vaccines in 2024, are implicitly missing, as they do not appear in the dataset at all.

`is.na()`

To identify missing values we can use is.na() function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA.

is.na(covid_vac)

Output

       year vaccine_type count
 [1,] FALSE        FALSE FALSE
 [2,] FALSE        FALSE FALSE
 [3,] FALSE        FALSE FALSE
 [4,] FALSE        FALSE FALSE
 [5,] FALSE        FALSE FALSE
 [6,] FALSE        FALSE FALSE
 [7,] FALSE        FALSE FALSE
 [8,] FALSE        FALSE FALSE
 [9,] FALSE        FALSE FALSE
[10,] FALSE         TRUE FALSE

is.na(covid_vac$vaccine_type)

Output

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

To identify the location or the number of NAs we can use the which() and sum() functions:

which(is.na(covid_vac))
sum(is.na(covid_vac))

Output

[1] 20

[1] 1

`na.omit()`

To omit all rows containing missing values, we can use na.omit() function in base R:

na.omit(covid_vac)

Output

year	vaccine_type	count
2020	Pfizer	0
2020	Moderna	3
2021	Pfizer	63
2021	Moderna	88
2021	Novavax	51
2023	Pfizer	38
2023	Moderna	19
2023	Novavax	5
2024	Moderna	9

`complete()`

We can use the complete() function to make our dataset more complete or to make missing values explicit in tidy data:

covid_vac |> complete(year, vaccine_type)

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	NA
2020	Pfizer	0
2020	NA	NA
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2021	NA	NA
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2023	NA	NA
2024	Moderna	9
2024	Novavax	NA
2024	Pfizer	NA
2024	NA	7

This function add missing values for potential combinations of year and vaccine_type. One problem is that R assumes NA in status as one of the combinations. To fix this, we can specify the labels of status to be considered as follows:

covid_vac |> complete(year, vaccine_type = c("Pfizer", "Moderna", "Novavax"))

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	NA
2020	Pfizer	0
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2024	Moderna	9
2024	Novavax	NA
2024	Pfizer	NA
2024	NA	7

We can use the fill argument to assign the fill value:

covid_vac |> complete(year, 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	0
2020	Pfizer	0
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2024	Moderna	9
2024	Novavax	0
2024	Pfizer	0
2024	NA	7

We can use the full_seq() function from tidyr to fill out the data frame with all years from 2020 to 2024 and assign vaccination types and count values of 0 to those years and for which there was no observation.

covid_vac |> complete(year = full_seq(year, period = 1), 
                      vaccine_type = c("Pfizer", "Moderna", "Novavax"),
                       fill = list(count = 0))

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	0
2020	Pfizer	0
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2022	Moderna	0
2022	Novavax	0
2022	Pfizer	0
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2024	Moderna	9
2024	Novavax	0
2024	Pfizer	0
2024	NA	7

`fill()`

The fill() function is used to fill missing values in a data frame, particularly within columns.

Let’s first make missing values in the covid_vac dataset explicit and assign it to a data frame named covid_vac_comp.

covid_vac_comp <-  covid_vac |> 
  complete(year = full_seq(year, period = 1), 
           vaccine_type = c("Pfizer", "Moderna", "Novavax"))

We can specify the direction to fill the missing values using the argument .direction. Remember to specify the list of columns to fill.

covid_vac_comp |> fill(count, .direction = "down")

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	3
2020	Pfizer	0
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2022	Moderna	63
2022	Novavax	63
2022	Pfizer	63
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2024	Moderna	9
2024	Novavax	9
2024	Pfizer	9
2024	NA	7

Similarly, we can fill upwards as follows:

covid_vac_comp |> fill(count, .direction = "up")

Output

year	vaccine_type	count
2020	Moderna	3
2020	Novavax	0
2020	Pfizer	0
2021	Moderna	88
2021	Novavax	51
2021	Pfizer	63
2022	Moderna	19
2022	Novavax	19
2022	Pfizer	19
2023	Moderna	19
2023	Novavax	5
2023	Pfizer	38
2024	Moderna	9
2024	Novavax	7
2024	Pfizer	7
2024	NA	7

Once the data is structured and organized according to tidy principles, we can begin manipulating and transforming it. The next section illustrates how this can be accomplished using the dplyr package from the tidyverse package suit.