library(tidyverse)
Into the Tidyverse
Common tasks in working with data include actions like removing rows or columns, performing calculations, or adding new columns. This sort of operations is known as data manipulation. It is the process of cleaning, organizing, and transforming raw data into a more structured and usable format for analysis.
In this workshop, we’ll guide you through the process of data manipulation in R, starting with the tidyverse. The tidyverse is a collection of packages that align with a data science philosophy developed by Hadley Wickham and the RStudio team. Many users find it to be a more intuitive way to grasp R concepts. Although some tasks may be more straightforward in base-R, we’ll also highlight those aspects.
If you’ve already installed the tidyverse package (if not, you can do so by running the command: install.packages("tidyverse")
), let’s proceed to load it into our R session:
The following are key techniques outlined in Hadley Wickham and Garrett Grolemund’s book, R for Data Science:
This workshop focuses on:
- Import:
readr
to import data. - Tidy:
tidyr
to organize rows of data into unique values. - Transform:
dplyr
to perform data manipulation, involving tasks such as subsetting by rows or columns, sorting, and joining. - Visualize:
ggplot2
to create static plots, applying the principles of the grammar of graphics.
This workflow is of utmost importance. Instead of constructing analyses based on the unconventional format of your data, take measures to tidy up your data. Tidy data enables the use of number of analytical and visualization tools. It eliminates the need to develop ad-hoc methods to accommodate your data. This not only saves time but also enhances the clarity and comprehensibility of your work, benefiting both your collaborators and, most importantly, your future self.
Step 1: Importing data
First we need to import data into our R session. This can be achieved either by using datasets bundled with R packages or by importing external data into our workspace for data manipulation. This is the first step in the tidyverse workflow.
Preloaded Data in R Packages
A great way to learn data science tools is using the data provided by R packages. The data()
function is a convenient way to explore and import pre-loaded datasets that come bundled with the R environment.
library(tidyverse)
data()
View Output
To import pre-loaded datasets bundled with a particular package (e.g., tidyr):
data(package = "tidyr")
View Output
When you load a package the pre-loaded datasets automatically imported into you R environment. Therefore you can access them directly as follows.
household
Output
family | dob_child1 | dob_child2 | name_child1 | name_child2 |
---|---|---|---|---|
1 | 1998-11-26 | 2000-01-29 | Susan | Jose |
2 | 1996-06-22 | NA | Mark | NA |
3 | 2002-07-11 | 2004-04-05 | Sam | Seth |
4 | 2004-10-10 | 2009-08-27 | Craig | Khai |
5 | 2000-12-05 | 2005-02-28 | Parker | Gracie |
?household
View Output
household | R Documentation |
Household data
Description
This dataset is based on an example in vignette(“datatable-reshape”, package = “data.table”)
Usage
household
Format
A data frame with 5 rows and 5 columns:
- family
-
Family identifier
- dob_child1
-
Date of birth of first child
- dob_child2
-
Date of birth of second child
- name_child1
-
Name of first child
?
- name_child2
-
Name of second child
Reading the Data
At some point, you want to apply what you’ve learned to your own data. In this section, you’ll learn the basics of reading data files into R using the readr package. The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data.
We will use the read_csv()
function from readr package to import a dataset. (See also read.csv()
in base R.) CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.
The first argument of the read_csv()
function takes the path to the file (or a web link). The following code will work if the cms_hospital_patient_satisfaction_2016_sampled.csv file is in the data/patient_satisfaction path.
# here the first argument is a path
<- read_csv("data/cms_hospital_patient_satisfaction.csv") cms_data
Output
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, Facility Name, County, Hospital Type
dbl (4): Star Rating, No of Surveys, Response Rate, Overall Rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# here the first argument is a web link
<- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv") mtvcars
Output
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This command prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
Other file types
Once you’ve mastered read_csv()
, using other functions in the readr package is straightforward. It’s just a matter of knowing which function to use:
read_csv2()
reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker or thousands seperator.read_tsv()
reads tab-delimited files.read_delim()
reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.
Exploring the Data
In the previous section we imported a dataset that is bundled with the tidyr package, into a dataframe named household. In the next section we read in a CSV file and created a data frame named cms_data. This section demonstrates different ways to get to know these two data objects.
The class()
function is used to identify the data type or data structure of an object (or variable):
class(household)
class(cms_data)
Output
[1] "tbl_df" "tbl" "data.frame"
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:
cms_data
Output
ID | Facility Name | County | Hospital Type | Star Rating | No of Surveys | Response Rate | Overall Rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
The dim()
function prints the dimensions (rows x columns) of the data frame:
dim(cms_data)
Output
[1] 15 8
This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).
The nrow()
function prints the number of rows while ncol()
prints the number of columns:
nrow(cms_data)
ncol(cms_data)
Output
[1] 15
[1] 8
The View()
function gives a spreadsheet-like view of the data frame:
View(cms_data)
By clicking the object on the environment tab also gives a spreadsheet-like view of the object:
The glimpse()
function (dplyr package) displays a compact summary of the data frame, showing you key details such as the data types of each column, the first few values, and the total number of observations.
glimpse(cms_data)
Output
Rows: 15
Columns: 8
$ ID <chr> "050424", "140103", "100051", "040062", "440048", "45…
$ `Facility Name` <chr> "SCRIPPS GREEN HOSPITAL", "ST BERNARD HOSPITAL", "SOU…
$ County <chr> "SAN DIEGO", "COOK", "LAKE", "SEBASTIAN", "SHELBY", "…
$ `Hospital Type` <chr> "Acute Care Hospital", "Acute Care Hospital", "Acute …
$ `Star Rating` <dbl> 4, 1, 2, 3, 2, 3, 3, 4, 4, 2, 3, 4, 4, 3, 3
$ `No of Surveys` <dbl> 3110, 264, 1382, 2506, 1799, 1379, 114, 247, 619, 171…
$ `Response Rate` <dbl> 41, 6, 20, 35, 18, 24, 22, 34, 32, 21, 14, 24, 35, 29…
$ `Overall Rating` <dbl> 5, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2
The head()
function prints the top 6 rows of a data frame:
head(cms_data)
Output
ID | Facility Name | County | Hospital Type | Star Rating | No of Surveys | Response Rate | Overall Rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
Similarly, the tail()
function prints the bottom 6 rows of the data frame:
tail(cms_data)
Output
ID | Facility Name | County | Hospital Type | Star Rating | No of Surveys | Response Rate | Overall Rating |
---|---|---|---|---|---|---|---|
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
The colnames()
function displays all the column names:
colnames(cms_data)
[1] "ID" "Facility Name" "County" "Hospital Type"
[5] "Star Rating" "No of Surveys" "Response Rate" "Overall Rating"
The $
symbol allows access to individual columns. To display ‘Hospital Type’ column:
$Hospital Type cms_data
Error: <text>:1:19: unexpected symbol
1: cms_data$Hospital Type
^
Since the column names contain spaces, they need to be enclosed within `` (backticks) for R to interpret it as a single variable:
$`Hospital Type` # or cms_data$"Hospital Type" cms_data
It is a good practice to rename all the columns with spaces into a format that R can interpret. Conventionally, _ (underscore) is used to separate words in column names and variables instead of (space). We can rename a single column using the rename()
function:
<- rename(cms_data, Hospital_Type = "Hospital Type")
cms_data head(cms_data)
Output
ID | Facility Name | County | Hospital_Type | Star Rating | No of Surveys | Response Rate | Overall Rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
Note: To modify the original cms_data object, it is essential to assign the renamed object to cms_data. Otherwise, a copy of the cms_data object is changed.
Or by directly assigning a new column name:
colnames(cms_data)[2] <- "Facility_Name"
head(cms_data)
Output
ID | Facility_Name | County | Hospital_Type | Star Rating | No of Surveys | Response Rate | Overall Rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
Using the above methods to rename multiple columns just to replace space with underscore can be laborious. A quick way to replace all spaces in the column names with underscore is shown below.
Option 1:
# by string substituting _ in place of ' '(space)
colnames(cms_data) <- gsub(" ", "_", colnames(cms_data))
colnames(cms_data)
Output
[1] "ID" "Facility_Name" "County" "Hospital_Type"
[5] "Star_Rating" "No_of_Surveys" "Response_Rate" "Overall_Rating"
Here the gsub()
function replace space with underscores in the column names of the cms_data. This new object needs to be assigned (<-
) to colnames(cms_data)
in order for the changes to be saved in the cms_data object.
Option 2:
Here we use the janitor package which contains functions to clean data. You first need to install and load the library before using it.
# using janitor package
library(janitor) # remember to install janitor: install.packages("janitor")
<- cms_data |> clean_names()
cms_data cms_data
Output
id | facility_name | county | hospital_type | star_rating | no_of_surveys | response_rate | overall_rating |
---|---|---|---|---|---|---|---|
050424 | SCRIPPS GREEN HOSPITAL | SAN DIEGO | Acute Care Hospital | 4 | 3110 | 41 | 5 |
140103 | ST BERNARD HOSPITAL | COOK | Acute Care Hospital | 1 | 264 | 6 | 2 |
100051 | SOUTH LAKE HOSPITAL | LAKE | Acute Care Hospital | 2 | 1382 | 20 | 2 |
040062 | MERCY HOSPITAL FORT SMITH | SEBASTIAN | Acute Care Hospital | 3 | 2506 | 35 | 3 |
440048 | BAPTIST MEMORIAL HOSPITAL | SHELBY | Acute Care Hospital | 2 | 1799 | 18 | 2 |
450011 | ST JOSEPH REGIONAL HEALTH CENTER | BRAZOS | Acute Care Hospital | 3 | 1379 | 24 | 3 |
151317 | GREENE COUNTY GENERAL HOSPITAL | GREENE | Critical Access Hospital | 3 | 114 | 22 | 3 |
061327 | SOUTHWEST MEMORIAL HOSPITAL | MONTEZUMA | Critical Access Hospital | 4 | 247 | 34 | 3 |
490057 | SENTARA GENERAL HOSPITAL | VIRGINIA BEACH | Acute Care Hospital | 4 | 619 | 32 | 3 |
110215 | PIEDMONT FAYETTE HOSPITAL | FAYETTE | Acute Care Hospital | 2 | 1714 | 21 | 2 |
050704 | MISSION COMMUNITY HOSPITAL | LOS ANGELES | Acute Care Hospital | 3 | 241 | 14 | 3 |
100296 | DOCTORS HOSPITAL | MIAMI-DADE | Acute Care Hospital | 4 | 393 | 24 | 3 |
440003 | SUMNER REGIONAL MEDICAL CENTER | SUMNER | Acute Care Hospital | 4 | 680 | 35 | 2 |
501339 | WHIDBEY GENERAL HOSPITAL | ISLAND | Critical Access Hospital | 3 | 389 | 29 | 3 |
050116 | NORTHRIDGE MEDICAL CENTER | LOS ANGELES | Acute Care Hospital | 3 | 1110 | 20 | 2 |
Here we are sending the cms_data data frame into the function clean_names()
which replaces replaces spaces with underscore in column names. Note that it also convert all characters in column names to lower case.
The default format is snake case styled as snake_case. You can specify other formats like camel case, title case.
Option 3:
Additionally, we can use the make.names()
function in base R which replaces ’ ’ (spaces) with . (dots).
# using make.names() function
colnames(cms_data) <- make.names(colnames(cms_data))
colnames(cms_data)
Output
[1] "ID" "Facility_Name" "County" "Hospital_Type"
[5] "Star.Rating" "No.of.Surveys" "Response.Rate" "Overall.Rating"
The str()
function shows the structure of the data:
str(cms_data)
Output
spc_tbl_ [15 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : chr [1:15] "050424" "140103" "100051" "040062" ...
$ Facility_Name : chr [1:15] "SCRIPPS GREEN HOSPITAL" "ST BERNARD HOSPITAL" "SOUTH LAKE HOSPITAL" "MERCY HOSPITAL FORT SMITH" ...
$ County : chr [1:15] "SAN DIEGO" "COOK" "LAKE" "SEBASTIAN" ...
$ Hospital_Type : chr [1:15] "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" "Acute Care Hospital" ...
$ Star Rating : num [1:15] 4 1 2 3 2 3 3 4 4 2 ...
$ No of Surveys : num [1:15] 3110 264 1382 2506 1799 ...
$ Response Rate : num [1:15] 41 6 20 35 18 24 22 34 32 21 ...
$ Overall Rating: num [1:15] 5 2 2 3 2 3 3 3 3 2 ...
- attr(*, "spec")=
.. cols(
.. ID = col_character(),
.. `Facility Name` = col_character(),
.. County = col_character(),
.. `Hospital Type` = col_character(),
.. `Star Rating` = col_double(),
.. `No of Surveys` = col_double(),
.. `Response Rate` = col_double(),
.. `Overall Rating` = col_double()
.. )
- attr(*, "problems")=<externalptr>
The summary()
function generates summary statistics:
summary(cms_data)
Output
ID Facility_Name County Hospital_Type
Length:15 Length:15 Length:15 Length:15
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Star Rating No of Surveys Response Rate Overall Rating
Min. :1.0 Min. : 114.0 Min. : 6 Min. :2.000
1st Qu.:2.5 1st Qu.: 326.5 1st Qu.:20 1st Qu.:2.000
Median :3.0 Median : 680.0 Median :24 Median :3.000
Mean :3.0 Mean :1063.1 Mean :25 Mean :2.733
3rd Qu.:4.0 3rd Qu.:1548.0 3rd Qu.:33 3rd Qu.:3.000
Max. :4.0 Max. :3110.0 Max. :41 Max. :5.000
A statitical overview can be obtained using the skim()
function in skimr package:
library(skimr)
skim(cms_data)
Output
Name | cms_data |
Number of rows | 15 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ID | 0 | 1 | 6 | 6 | 0 | 15 | 0 |
Facility_Name | 0 | 1 | 16 | 32 | 0 | 15 | 0 |
County | 0 | 1 | 4 | 14 | 0 | 14 | 0 |
Hospital_Type | 0 | 1 | 19 | 24 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Star Rating | 0 | 1 | 3.00 | 0.93 | 1 | 2.5 | 3 | 4 | 4 | ▁▃▁▇▇ |
No of Surveys | 0 | 1 | 1063.13 | 909.05 | 114 | 326.5 | 680 | 1548 | 3110 | ▇▁▃▁▁ |
Response Rate | 0 | 1 | 25.00 | 9.30 | 6 | 20.0 | 24 | 33 | 41 | ▂▇▇▆▆ |
Overall Rating | 0 | 1 | 2.73 | 0.80 | 2 | 2.0 | 3 | 3 | 5 | ▆▇▁▁▁ |
Writing Data to a File
Writing data to a file is a fundamental operation in programming and data analysis. It involves taking data from within a program or environment and storing it in a file on a disk for later use or sharing. This section explains the basics of writing a data file using the readr package.
The write_csv()
and write_tsv()
functions are part of the readr package, which is designed for writing delimited files like CSV (comma-separated values) and TSV (tab-separated values). These functions are used to write data frames into CSV and TSV files, respectively.
We first provide the variable name of the data frame followed by the file name (ideally including the full folder location).
To write a CSV file:
# on Mac:
write_csv(cms_data, "~/Desktop/cms_data.csv")
# on Windows
write_csv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")
To write a TSV file:
# on Mac:
write_tsv(cms_data, "~/Desktop/cms_data.csv")
# on Windows
write_tsv(cms_data, "C:/Users/srajapaksa/Desktop/cms_data.csv")
Step 2: Tidy Data
Tidy data is a structured and organized format for presenting data that follows a simple convention: variables are placed in columns, observations are placed in rows and values are placed in cells. This standardized arrangement makes it easy to work with and analyze data efficiently. The principles of tidy data, popularized by Hadley Wickham, are designed to promote consistency and ease of use in data analysis.
This is the second step in the tidyverse workflow.
Let’s take a look at some examples.
Data is often entered in a wide format, where each row typically represents a site, subject, or patient, and there are multiple observation variables containing the same type of data.
For instance, consider the AirPassengers dataset. It contains information on monthly airline passenger numbers from 1949 to 1960. In this dataset, each row corresponds to a single year, and the columns represent each month from January to December.
AirPassengers
Output
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
Wide format is intuitive for data entry. But it is less so for data analysis. Consider calculating the monthly mean; where would you place it? Would it be another row?
Data needs to be reshaped to conform to the tidy data structure. It involves using four primary verbs (or pairs of opposites):
- Convert columns into rows (
pivot_longer()
). - Convert rows into columns (
pivot_wider()
). - Convert a character column into multiple columns (
separate_wider_delim()
andseparate_wider_position()
). - Combine multiple character columns into a single column (
unite()
).
First, load the tidyr
package. Since you have already installed the tidyverse, you should be able to load it directly as follows (otherwise install it using the command install.packages("tidyverse")
if necessary):
library(tidyverse)
Converting data from wide to long format
First read the counts file called GSE60450_normalized_data.csv that is in a folder called data (i.e. the path to the file should be data/GSE60450_normalized_data.csv).
<- read_csv("data/GSE60450_normalized_data.csv")
counts head(counts)
Output
X | gene_symbol | GSM1480291 | GSM1480292 | GSM1480293 | GSM1480294 | GSM1480295 | GSM1480296 | GSM1480297 | GSM1480298 | GSM1480299 | GSM1480300 | GSM1480301 | GSM1480302 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000031805 | Jak3 | 82.22869 | 81.18604 | 36.13808 | 36.62751 | 12.35039 | 11.85235 | 10.59006 | 14.88337 | 7.57182 | 7.05763 | 3.16982 | 4.25451 |
ENSMUSG00000033909 | Usp36 | 88.22360 | 94.59098 | 76.11531 | 63.73863 | 27.10872 | 24.87050 | 95.67017 | 100.73912 | 78.07470 | 59.35009 | 36.36491 | 40.37057 |
ENSMUSG00000065524 | Mir135a-2 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
ENSMUSG00000030374 | Strn4 | 94.26324 | 85.05204 | 59.54855 | 54.30684 | 24.42892 | 24.90936 | 125.00263 | 102.99289 | 110.64124 | 97.98130 | 77.96883 | 105.70093 |
ENSMUSG00000098547 | Mir6921 | 0.08948 | 0.08404 | 0.00000 | 0.08459 | 0.03884 | 0.07772 | 0.00000 | 0.08505 | 0.07726 | 0.08255 | 0.04403 | 0.04727 |
ENSMUSG00000035629 | Rubcn | 37.84846 | 37.73554 | 21.32397 | 27.70326 | 30.13807 | 36.33424 | 33.84822 | 36.91075 | 27.46715 | 31.36722 | 56.00020 | 64.33764 |
To transform this table from a wide format to a long format, we use the pivot_longer()
function. It’s important to note that this function does not create tidy data as it duplicates rows. However, the output in ‘long format’ from pivot_longer()
is often necessary for ggplot, where each aesthetic or facet category must be a single column of values and for left_join()
, which will be introduced later.
This operation will convert multiple columns with counts for each sample into a single column containing all the expression values, as illustrated in the image below.
The pivot_longer()
function takes three arguments:
- cols = : a vector indicating the names of the columns to be converted into labels in long form.
- names_to = : a name or vector of names for the new column(s) containing the labels from the specified columns.
- **values_to =* *: a name for the new column containing the values corresponding to the specified columns.
It’s important to note that when using pivot_wider()
, the new column names need to be enclosed in quotes.
<- counts |>
seqdata pivot_longer(cols = starts_with("GSM"),
names_to = "Sample",
values_to = "Count")
The cols = starts_with("GSM")
command returns a vector of columns whose names starts with “GSM”. pivot_longer()
will then transform the those columns into two new columns, denoted as “Sample” and “Count.” The parameter names_to = "Sample"
indicates that the new column containing the specified columns (defined by cols) should be named “Sample,” while values_to = "Count"
specifies that the new column containing the values should be named “Count.”
seqdata
Output
X | gene_symbol | Sample | Count |
---|---|---|---|
ENSMUSG00000031805 | Jak3 | GSM1480291 | 82.22869 |
ENSMUSG00000031805 | Jak3 | GSM1480292 | 81.18604 |
ENSMUSG00000031805 | Jak3 | GSM1480293 | 36.13808 |
ENSMUSG00000031805 | Jak3 | GSM1480294 | 36.62751 |
ENSMUSG00000031805 | Jak3 | GSM1480295 | 12.35039 |
ENSMUSG00000031805 | Jak3 | GSM1480296 | 11.85235 |
ENSMUSG00000031805 | Jak3 | GSM1480297 | 10.59006 |
ENSMUSG00000031805 | Jak3 | GSM1480298 | 14.88337 |
ENSMUSG00000031805 | Jak3 | GSM1480299 | 7.57182 |
ENSMUSG00000031805 | Jak3 | GSM1480300 | 7.05763 |
ENSMUSG00000031805 | Jak3 | GSM1480301 | 3.16982 |
ENSMUSG00000031805 | Jak3 | GSM1480302 | 4.25451 |
ENSMUSG00000033909 | Usp36 | GSM1480291 | 88.22360 |
ENSMUSG00000033909 | Usp36 | GSM1480292 | 94.59098 |
ENSMUSG00000033909 | Usp36 | GSM1480293 | 76.11531 |
ENSMUSG00000033909 | Usp36 | GSM1480294 | 63.73863 |
ENSMUSG00000033909 | Usp36 | GSM1480295 | 27.10872 |
ENSMUSG00000033909 | Usp36 | GSM1480296 | 24.87050 |
ENSMUSG00000033909 | Usp36 | GSM1480297 | 95.67017 |
ENSMUSG00000033909 | Usp36 | GSM1480298 | 100.73912 |
ENSMUSG00000033909 | Usp36 | GSM1480299 | 78.07470 |
ENSMUSG00000033909 | Usp36 | GSM1480300 | 59.35009 |
ENSMUSG00000033909 | Usp36 | GSM1480301 | 36.36491 |
ENSMUSG00000033909 | Usp36 | GSM1480302 | 40.37057 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480291 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480292 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480293 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480294 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480295 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480296 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480297 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480298 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480299 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480300 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480301 | 0.00000 |
ENSMUSG00000065524 | Mir135a-2 | GSM1480302 | 0.00000 |
ENSMUSG00000030374 | Strn4 | GSM1480291 | 94.26324 |
ENSMUSG00000030374 | Strn4 | GSM1480292 | 85.05204 |
ENSMUSG00000030374 | Strn4 | GSM1480293 | 59.54855 |
ENSMUSG00000030374 | Strn4 | GSM1480294 | 54.30684 |
ENSMUSG00000030374 | Strn4 | GSM1480295 | 24.42892 |
ENSMUSG00000030374 | Strn4 | GSM1480296 | 24.90936 |
ENSMUSG00000030374 | Strn4 | GSM1480297 | 125.00263 |
ENSMUSG00000030374 | Strn4 | GSM1480298 | 102.99289 |
ENSMUSG00000030374 | Strn4 | GSM1480299 | 110.64124 |
ENSMUSG00000030374 | Strn4 | GSM1480300 | 97.98130 |
ENSMUSG00000030374 | Strn4 | GSM1480301 | 77.96883 |
ENSMUSG00000030374 | Strn4 | GSM1480302 | 105.70093 |
ENSMUSG00000098547 | Mir6921 | GSM1480291 | 0.08948 |
ENSMUSG00000098547 | Mir6921 | GSM1480292 | 0.08404 |
ENSMUSG00000098547 | Mir6921 | GSM1480293 | 0.00000 |
ENSMUSG00000098547 | Mir6921 | GSM1480294 | 0.08459 |
ENSMUSG00000098547 | Mir6921 | GSM1480295 | 0.03884 |
ENSMUSG00000098547 | Mir6921 | GSM1480296 | 0.07772 |
ENSMUSG00000098547 | Mir6921 | GSM1480297 | 0.00000 |
ENSMUSG00000098547 | Mir6921 | GSM1480298 | 0.08505 |
ENSMUSG00000098547 | Mir6921 | GSM1480299 | 0.07726 |
ENSMUSG00000098547 | Mir6921 | GSM1480300 | 0.08255 |
ENSMUSG00000098547 | Mir6921 | GSM1480301 | 0.04403 |
ENSMUSG00000098547 | Mir6921 | GSM1480302 | 0.04727 |
ENSMUSG00000035629 | Rubcn | GSM1480291 | 37.84846 |
ENSMUSG00000035629 | Rubcn | GSM1480292 | 37.73554 |
ENSMUSG00000035629 | Rubcn | GSM1480293 | 21.32397 |
ENSMUSG00000035629 | Rubcn | GSM1480294 | 27.70326 |
ENSMUSG00000035629 | Rubcn | GSM1480295 | 30.13807 |
ENSMUSG00000035629 | Rubcn | GSM1480296 | 36.33424 |
ENSMUSG00000035629 | Rubcn | GSM1480297 | 33.84822 |
ENSMUSG00000035629 | Rubcn | GSM1480298 | 36.91075 |
ENSMUSG00000035629 | Rubcn | GSM1480299 | 27.46715 |
ENSMUSG00000035629 | Rubcn | GSM1480300 | 31.36722 |
ENSMUSG00000035629 | Rubcn | GSM1480301 | 56.00020 |
ENSMUSG00000035629 | Rubcn | GSM1480302 | 64.33764 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480291 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480292 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480293 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480294 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480295 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480296 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480297 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480298 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480299 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480300 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480301 | 0.00000 |
ENSMUSG00000094053 | Scgb2b7 | GSM1480302 | 0.00000 |
ENSMUSG00000055491 | Pprc1 | GSM1480291 | 194.87930 |
ENSMUSG00000055491 | Pprc1 | GSM1480292 | 165.14551 |
ENSMUSG00000055491 | Pprc1 | GSM1480293 | 159.82546 |
ENSMUSG00000055491 | Pprc1 | GSM1480294 | 167.99592 |
ENSMUSG00000055491 | Pprc1 | GSM1480295 | 75.69470 |
ENSMUSG00000055491 | Pprc1 | GSM1480296 | 82.42238 |
ENSMUSG00000055491 | Pprc1 | GSM1480297 | 281.97525 |
ENSMUSG00000055491 | Pprc1 | GSM1480298 | 241.83343 |
ENSMUSG00000055491 | Pprc1 | GSM1480299 | 227.84988 |
ENSMUSG00000055491 | Pprc1 | GSM1480300 | 232.81909 |
ENSMUSG00000055491 | Pprc1 | GSM1480301 | 166.85593 |
ENSMUSG00000055491 | Pprc1 | GSM1480302 | 150.56237 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480291 | 65.94401 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480292 | 78.91686 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480293 | 44.77617 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480294 | 35.69702 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480295 | 8.66081 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480296 | 9.98706 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480297 | 62.62120 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480298 | 61.82975 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480299 | 49.98944 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480300 | 42.01557 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480301 | 37.24541 |
ENSMUSG00000053080 | 2700081O15Rik | GSM1480302 | 40.08694 |
ENSMUSG00000039715 | Wdr34 | GSM1480291 | 18.20842 |
ENSMUSG00000039715 | Wdr34 | GSM1480292 | 18.95181 |
ENSMUSG00000039715 | Wdr34 | GSM1480293 | 11.60091 |
ENSMUSG00000039715 | Wdr34 | GSM1480294 | 8.41671 |
ENSMUSG00000039715 | Wdr34 | GSM1480295 | 3.72842 |
ENSMUSG00000039715 | Wdr34 | GSM1480296 | 3.69171 |
ENSMUSG00000039715 | Wdr34 | GSM1480297 | 17.98312 |
ENSMUSG00000039715 | Wdr34 | GSM1480298 | 13.69270 |
ENSMUSG00000039715 | Wdr34 | GSM1480299 | 9.00119 |
ENSMUSG00000039715 | Wdr34 | GSM1480300 | 9.69908 |
ENSMUSG00000039715 | Wdr34 | GSM1480301 | 10.38997 |
ENSMUSG00000039715 | Wdr34 | GSM1480302 | 11.10900 |
ENSMUSG00000033475 | Tomm6 | GSM1480291 | 53.46207 |
ENSMUSG00000033475 | Tomm6 | GSM1480292 | 49.03939 |
ENSMUSG00000033475 | Tomm6 | GSM1480293 | 50.82700 |
ENSMUSG00000033475 | Tomm6 | GSM1480294 | 49.95045 |
ENSMUSG00000033475 | Tomm6 | GSM1480295 | 47.49852 |
ENSMUSG00000033475 | Tomm6 | GSM1480296 | 46.04928 |
ENSMUSG00000033475 | Tomm6 | GSM1480297 | 51.79137 |
ENSMUSG00000033475 | Tomm6 | GSM1480298 | 49.49782 |
ENSMUSG00000033475 | Tomm6 | GSM1480299 | 52.84819 |
ENSMUSG00000033475 | Tomm6 | GSM1480300 | 47.00956 |
ENSMUSG00000033475 | Tomm6 | GSM1480301 | 37.11334 |
ENSMUSG00000033475 | Tomm6 | GSM1480302 | 35.97424 |
ENSMUSG00000026283 | Ing5 | GSM1480291 | 42.45649 |
ENSMUSG00000026283 | Ing5 | GSM1480292 | 45.46754 |
ENSMUSG00000026283 | Ing5 | GSM1480293 | 22.03338 |
ENSMUSG00000026283 | Ing5 | GSM1480294 | 22.45867 |
ENSMUSG00000026283 | Ing5 | GSM1480295 | 8.73849 |
ENSMUSG00000026283 | Ing5 | GSM1480296 | 11.11400 |
ENSMUSG00000026283 | Ing5 | GSM1480297 | 50.07299 |
ENSMUSG00000026283 | Ing5 | GSM1480298 | 50.94364 |
ENSMUSG00000026283 | Ing5 | GSM1480299 | 45.04458 |
ENSMUSG00000026283 | Ing5 | GSM1480300 | 36.89776 |
ENSMUSG00000026283 | Ing5 | GSM1480301 | 44.28946 |
ENSMUSG00000026283 | Ing5 | GSM1480302 | 46.70506 |
ENSMUSG00000037331 | Larp1 | GSM1480291 | 260.82331 |
ENSMUSG00000037331 | Larp1 | GSM1480292 | 246.03739 |
ENSMUSG00000037331 | Larp1 | GSM1480293 | 230.05686 |
ENSMUSG00000037331 | Larp1 | GSM1480294 | 220.23030 |
ENSMUSG00000037331 | Larp1 | GSM1480295 | 128.98005 |
ENSMUSG00000037331 | Larp1 | GSM1480296 | 120.11673 |
ENSMUSG00000037331 | Larp1 | GSM1480297 | 147.58143 |
ENSMUSG00000037331 | Larp1 | GSM1480298 | 128.20957 |
ENSMUSG00000037331 | Larp1 | GSM1480299 | 168.08661 |
ENSMUSG00000037331 | Larp1 | GSM1480300 | 133.35197 |
ENSMUSG00000037331 | Larp1 | GSM1480301 | 100.02551 |
ENSMUSG00000037331 | Larp1 | GSM1480302 | 98.32645 |
ENSMUSG00000074489 | Bglap3 | GSM1480291 | 185.84220 |
ENSMUSG00000074489 | Bglap3 | GSM1480292 | 188.25747 |
ENSMUSG00000074489 | Bglap3 | GSM1480293 | 148.76704 |
ENSMUSG00000074489 | Bglap3 | GSM1480294 | 163.80871 |
ENSMUSG00000074489 | Bglap3 | GSM1480295 | 51.34346 |
ENSMUSG00000074489 | Bglap3 | GSM1480296 | 48.84721 |
ENSMUSG00000074489 | Bglap3 | GSM1480297 | 52.83040 |
ENSMUSG00000074489 | Bglap3 | GSM1480298 | 84.66509 |
ENSMUSG00000074489 | Bglap3 | GSM1480299 | 323.54064 |
ENSMUSG00000074489 | Bglap3 | GSM1480300 | 485.86178 |
ENSMUSG00000074489 | Bglap3 | GSM1480301 | 197.27742 |
ENSMUSG00000074489 | Bglap3 | GSM1480302 | 163.89317 |
ENSMUSG00000038246 | Fam50b | GSM1480291 | 0.04474 |
ENSMUSG00000038246 | Fam50b | GSM1480292 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480293 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480294 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480295 | 0.03884 |
ENSMUSG00000038246 | Fam50b | GSM1480296 | 0.03886 |
ENSMUSG00000038246 | Fam50b | GSM1480297 | 0.07992 |
ENSMUSG00000038246 | Fam50b | GSM1480298 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480299 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480300 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480301 | 0.00000 |
ENSMUSG00000038246 | Fam50b | GSM1480302 | 0.09454 |
ENSMUSG00000066189 | Cacng3 | GSM1480291 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480292 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480293 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480294 | 0.04230 |
ENSMUSG00000066189 | Cacng3 | GSM1480295 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480296 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480297 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480298 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480299 | 0.03863 |
ENSMUSG00000066189 | Cacng3 | GSM1480300 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480301 | 0.00000 |
ENSMUSG00000066189 | Cacng3 | GSM1480302 | 0.00000 |
ENSMUSG00000005611 | Mrvi1 | GSM1480291 | 0.67107 |
ENSMUSG00000005611 | Mrvi1 | GSM1480292 | 0.92448 |
ENSMUSG00000005611 | Mrvi1 | GSM1480293 | 0.29211 |
ENSMUSG00000005611 | Mrvi1 | GSM1480294 | 0.21148 |
ENSMUSG00000005611 | Mrvi1 | GSM1480295 | 0.38838 |
ENSMUSG00000005611 | Mrvi1 | GSM1480296 | 0.03886 |
ENSMUSG00000005611 | Mrvi1 | GSM1480297 | 102.26398 |
ENSMUSG00000005611 | Mrvi1 | GSM1480298 | 81.05056 |
ENSMUSG00000005611 | Mrvi1 | GSM1480299 | 115.31568 |
ENSMUSG00000005611 | Mrvi1 | GSM1480300 | 138.38724 |
ENSMUSG00000005611 | Mrvi1 | GSM1480301 | 177.99434 |
ENSMUSG00000005611 | Mrvi1 | GSM1480302 | 185.16572 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480291 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480292 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480293 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480294 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480295 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480296 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480297 | 0.03996 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480298 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480299 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480300 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480301 | 0.00000 |
ENSMUSG00000064299 | 4921528I07Rik | GSM1480302 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480291 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480292 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480293 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480294 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480295 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480296 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480297 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480298 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480299 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480300 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480301 | 0.00000 |
ENSMUSG00000028174 | Rpe65 | GSM1480302 | 0.00000 |
ENSMUSG00000024902 | Mrpl11 | GSM1480291 | 27.60343 |
ENSMUSG00000024902 | Mrpl11 | GSM1480292 | 23.91038 |
ENSMUSG00000024902 | Mrpl11 | GSM1480293 | 21.86646 |
ENSMUSG00000024902 | Mrpl11 | GSM1480294 | 18.56753 |
ENSMUSG00000024902 | Mrpl11 | GSM1480295 | 11.84550 |
ENSMUSG00000024902 | Mrpl11 | GSM1480296 | 11.30830 |
ENSMUSG00000024902 | Mrpl11 | GSM1480297 | 14.02683 |
ENSMUSG00000024902 | Mrpl11 | GSM1480298 | 11.56650 |
ENSMUSG00000024902 | Mrpl11 | GSM1480299 | 12.13036 |
ENSMUSG00000024902 | Mrpl11 | GSM1480300 | 12.21671 |
ENSMUSG00000024902 | Mrpl11 | GSM1480301 | 10.21387 |
ENSMUSG00000024902 | Mrpl11 | GSM1480302 | 10.06901 |
Alternatively, we could achieve the same outcome by specifying a column range using the following command:
<- counts |>
seqdata pivot_longer(cols = GSM1480291:GSM1480302,
names_to = "Sample",
values_to = "Count")
We can also specify the columns we don’t want to reformat, and pivot_longer()
will then reformat all the columns except those. To achieve this, we place a minus sign (“-”) in front of the column names that we wish to exclude. This is a commonly used approach with pivot_longer()
, as it can be more convenient to exclude columns we don’t need rather than explicitly include the ones we want.
<- counts |>
seqdata pivot_longer(cols = -c(X, gene_symbol),
names_to = "Sample",
values_to = "Count")
Converting data from long to wide format
First, read the annotation file called GSE60450_annotation.csv (the path to the file should be data/GSE60450_annotation.csv).
<- read_csv("data/GSE60450_annotation.csv")
annot head(annot)
Output
ENSEMBL | Type | Annotation |
---|---|---|
ENSMUSG00000031805 | SYMBOL | Jak3 |
ENSMUSG00000031805 | GENENAME | Janus kinase 3 |
ENSMUSG00000033909 | SYMBOL | Usp36 |
ENSMUSG00000033909 | GENENAME | ubiquitin specific peptidase 36 |
ENSMUSG00000065524 | SYMBOL | Mir135a-2 |
ENSMUSG00000065524 | GENENAME | microRNA 135a-2 |
To transform this table so that it conforms to the tidy principles, we use the pivot_wider()
function.
This operation will convert multiple rows with type and annotation into columns containing the Symbol and Gene_name, as illustrated in the image below.
The pivot_wider()
function takes two arguments:
- names_from = : a name or a vector of names of column(s) containing the labels that will be transformed into the new column names.
- values_from = : a name or a vector of names of column(s) containing the values that will fill the new columns.
In our scenario, to reshape the annot data frame, we will use the column names Type and Annotation:
<- annot |>
annot_tidy pivot_wider(names_from = Type,
values_from = Annotation)
The above operation changes the ‘shape’ of the dataframe from a longer format (more rows) to a wider format (more columns). While the original table consists of 40 rows, using pivot_wider()
results in only 20 rows. This reduction is due to the de-duplication of rows during the creation of new columns.
annot_tidy
Output
ENSEMBL | SYMBOL | GENENAME |
---|---|---|
ENSMUSG00000031805 | Jak3 | Janus kinase 3 |
ENSMUSG00000033909 | Usp36 | ubiquitin specific peptidase 36 |
ENSMUSG00000065524 | Mir135a-2 | microRNA 135a-2 |
ENSMUSG00000030374 | Strn4 | striatin, calmodulin binding protein 4 |
ENSMUSG00000098547 | Mir6921 | microRNA 6921 |
ENSMUSG00000035629 | Rubcn | RUN domain and cysteine-rich domain containing, Beclin 1-interacting protein |
ENSMUSG00000094053 | Scgb2b7 | secretoglobin, family 2B, member 7 |
ENSMUSG00000055491 | Pprc1 | peroxisome proliferative activated receptor, gamma, coactivator-related 1 |
ENSMUSG00000053080 | Zfta | zinc finger translocation associated |
ENSMUSG00000039715 | Dync2i2 | dynein 2 intermediate chain 2 |
ENSMUSG00000033475 | Tomm6 | translocase of outer mitochondrial membrane 6 |
ENSMUSG00000026283 | Ing5 | inhibitor of growth family, member 5 |
ENSMUSG00000037331 | Larp1 | La ribonucleoprotein 1, translational regulator |
ENSMUSG00000074489 | Bglap3 | bone gamma-carboxyglutamate protein 3 |
ENSMUSG00000038246 | Fam50b | family with sequence similarity 50, member B |
ENSMUSG00000066189 | Cacng3 | calcium channel, voltage-dependent, gamma subunit 3 |
ENSMUSG00000005611 | Irag1 | inositol 1,4,5-triphosphate receptor associated 1 |
ENSMUSG00000064299 | 4921528I07Rik | RIKEN cDNA 4921528I07 gene |
ENSMUSG00000028174 | Rpe65 | retinal pigment epithelium 65 |
ENSMUSG00000024902 | Mrpl11 | mitochondrial ribosomal protein L11 |
It’s important to note that since we only have two distinct labels in the Type column, we are essentially replacing the existing two columns with just two new columns. Consequently, the shape of the output doesn’t technically become wider than the input data frame. However, when there are more than two unique labels in the names_from column, the output will indeed become wider compared to the input.
Separating Columns
First, read the metadata file called GSE60450_metadata.csv (the path to the file should be data/GSE60450_metadata.csv).
<- read_csv("data/GSE60450_metadata.csv")
metadata head(metadata)
Output
gene_id | characteristics |
---|---|
GSM1480291 | mammary gland;luminal cells;virgin |
GSM1480292 | mammary gland;luminal cells;virgin |
GSM1480293 | mammary gland;luminal cells;18.5 day pregnancy |
GSM1480294 | mammary gland;luminal cells;18.5 day pregnancy |
GSM1480295 | mammary gland;luminal cells;2 day lactation |
GSM1480296 | mammary gland;luminal cells;2 day lactation |
To transform this table so that it conforms to the tidy principles, we use the separate_wider_position()
/separate_wider_delim()
function. This operation will separate characteristic column into 3 separate columns containing the tissue_type, immunophenotype and development_stage, as illustrated in the image below.
The separate_wider_delim()
function takes three arguments:
- cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
- delim = : delimeter (or separator) between values. This is same as the
delim =
inread_delim()
. - names = : a vector containing column names for the the new columns.
To separate characteristic column in the metadata data frame into three separate columns based on the delimeter ; (semi colon), we can use the separate_wider_delim()
function:
<- metadata |>
metadata_lform separate_wider_delim(cols = characteristics,
delim =";",
names = c("tissue_type", "immunophenotype", "development_stage"))
metadata_lform
Output
gene_id | tissue_type | immunophenotype | development_stage |
---|---|---|---|
GSM1480291 | mammary gland | luminal cells | virgin |
GSM1480292 | mammary gland | luminal cells | virgin |
GSM1480293 | mammary gland | luminal cells | 18.5 day pregnancy |
GSM1480294 | mammary gland | luminal cells | 18.5 day pregnancy |
GSM1480295 | mammary gland | luminal cells | 2 day lactation |
GSM1480296 | mammary gland | luminal cells | 2 day lactation |
GSM1480297 | mammary gland | basal cells | virgin |
GSM1480298 | mammary gland | basal cells | virgin |
GSM1480299 | mammary gland | basal cells | 18.5 day pregnancy |
GSM1480300 | mammary gland | basal cells | 18.5 day pregnancy |
GSM1480301 | mammary gland | basal cells | 2 day lactation |
GSM1480302 | mammary gland | basal cells | 2 day lactation |
The separate_wider_position()
function splits at fixed widths and takes two arguments:
- cols = : a name or a vector of names of the column(s) that requires separation into multiple columns.
- widths = : a named vector containing numbers where the names become the new column names and values specify the column widths.
For instance, we can divide the gene_id column into three separate columns to evaluate the functionality of this operation (this is provided purely as an example):
|>
metadata_lform separate_wider_position(cols = gene_id,
widths = c(code = 3, prefix = 4, id = 3))
Output
code | prefix | id | tissue_type | immunophenotype | development_stage |
---|---|---|---|---|---|
GSM | 1480 | 291 | mammary gland | luminal cells | virgin |
GSM | 1480 | 292 | mammary gland | luminal cells | virgin |
GSM | 1480 | 293 | mammary gland | luminal cells | 18.5 day pregnancy |
GSM | 1480 | 294 | mammary gland | luminal cells | 18.5 day pregnancy |
GSM | 1480 | 295 | mammary gland | luminal cells | 2 day lactation |
GSM | 1480 | 296 | mammary gland | luminal cells | 2 day lactation |
GSM | 1480 | 297 | mammary gland | basal cells | virgin |
GSM | 1480 | 298 | mammary gland | basal cells | virgin |
GSM | 1480 | 299 | mammary gland | basal cells | 18.5 day pregnancy |
GSM | 1480 | 300 | mammary gland | basal cells | 18.5 day pregnancy |
GSM | 1480 | 301 | mammary gland | basal cells | 2 day lactation |
GSM | 1480 | 302 | mammary gland | basal cells | 2 day lactation |
Uniting Columns
The unite()
function is the complement of separate()
. Therefore, let’s revert what we did in the previous section to combine multiple columns to a single column as illustrated in the image below.
The unite()
function takes three arguments:
- col = : name of the new column that will contain the united values.
- … = : a vector containing column names to unite.
- sep = : delimeter (or separator) this is same as the
delim =
inread_delim()
. If we don’t specify a separator to insert between the combined values, they will be separated by _ (underscores).
To separate characteristic column in the metadata data frame into three separate columns:
|>
metadata_lform unite(col = characteristics,
tissue_type, immunophenotype, development_stage,sep = ",")
Output
gene_id | characteristics |
---|---|
GSM1480291 | mammary gland,luminal cells,virgin |
GSM1480292 | mammary gland,luminal cells,virgin |
GSM1480293 | mammary gland,luminal cells,18.5 day pregnancy |
GSM1480294 | mammary gland,luminal cells,18.5 day pregnancy |
GSM1480295 | mammary gland,luminal cells,2 day lactation |
GSM1480296 | mammary gland,luminal cells,2 day lactation |
GSM1480297 | mammary gland,basal cells,virgin |
GSM1480298 | mammary gland,basal cells,virgin |
GSM1480299 | mammary gland,basal cells,18.5 day pregnancy |
GSM1480300 | mammary gland,basal cells,18.5 day pregnancy |
GSM1480301 | mammary gland,basal cells,2 day lactation |
GSM1480302 | mammary gland,basal cells,2 day lactation |
Missing Values
A value can be missing in one of two possible ways:
- Explicitly, meaning it is flagged with NA.
- Implicitly, implying that it is just not present in the data.
Let’s illustrate this idea with a very simple data frame:
<- data.frame(
covid_vac year = c(2020, 2020, 2021, 2021, 2021, 2023, 2023,
2023, 2024, 2024),
vaccine_type = c("Pfizer", "Moderna", "Pfizer", "Moderna", "Novavax",
"Pfizer", "Moderna",
"Novavax", "Moderna", NA),
count = c(0, 3, 63, 88, 51,
38, 19,
5, 9, 7)
) covid_vac
Output
year | vaccine_type | count |
---|---|---|
2020 | Pfizer | 0 |
2020 | Moderna | 3 |
2021 | Pfizer | 63 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2023 | Pfizer | 38 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2024 | Moderna | 9 |
2024 | NA | 7 |
In this dataset, we identify two occurrences of missing values:
- The vaccine_type in 2024 with a count of 7 is explicitly missing, denoted by the presence of NA in the cell where its value should be.
- The counts for the Novavax vaccine in 2020 and Pfizer, Novavax vaccines in 2024, are implicitly missing, as they do not appear in the dataset at all.
is.na()
To identify missing values we can use is.na()
function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA.
is.na(covid_vac)
Output
year vaccine_type count
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE TRUE FALSE
is.na(covid_vac$vaccine_type)
Output
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
To identify the location or the number of NAs we can use the which()
and sum()
functions:
which(is.na(covid_vac))
sum(is.na(covid_vac))
Output
[1] 20
[1] 1
na.omit()
To omit all rows containing missing values, we can use na.omit()
function in base R:
na.omit(covid_vac)
Output
year | vaccine_type | count |
---|---|---|
2020 | Pfizer | 0 |
2020 | Moderna | 3 |
2021 | Pfizer | 63 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2023 | Pfizer | 38 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2024 | Moderna | 9 |
complete()
We can use the complete()
function to make our dataset more complete or to make missing values explicit in tidy data:
|> complete(year, vaccine_type) covid_vac
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | NA |
2020 | Pfizer | 0 |
2020 | NA | NA |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2021 | NA | NA |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2023 | NA | NA |
2024 | Moderna | 9 |
2024 | Novavax | NA |
2024 | Pfizer | NA |
2024 | NA | 7 |
This function add missing values for potential combinations of year and vaccine_type. One problem is that R assumes NA in status as one of the combinations. To fix this, we can specify the labels of status to be considered as follows:
|> complete(year, vaccine_type = c("Pfizer", "Moderna", "Novavax")) covid_vac
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | NA |
2020 | Pfizer | 0 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2024 | Moderna | 9 |
2024 | Novavax | NA |
2024 | Pfizer | NA |
2024 | NA | 7 |
We can use the fill argument to assign the fill value:
|> complete(year,
covid_vac vaccine_type = c("Pfizer", "Moderna", "Novavax"),
fill = list(count = 0))
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | 0 |
2020 | Pfizer | 0 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2024 | Moderna | 9 |
2024 | Novavax | 0 |
2024 | Pfizer | 0 |
2024 | NA | 7 |
We can use the full_seq()
function from tidyr
to fill out the data frame with all years from 2020 to 2024 and assign vaccination types and count values of 0 to those years and for which there was no observation.
|> complete(year = full_seq(year, period = 1),
covid_vac vaccine_type = c("Pfizer", "Moderna", "Novavax"),
fill = list(count = 0))
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | 0 |
2020 | Pfizer | 0 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2022 | Moderna | 0 |
2022 | Novavax | 0 |
2022 | Pfizer | 0 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2024 | Moderna | 9 |
2024 | Novavax | 0 |
2024 | Pfizer | 0 |
2024 | NA | 7 |
fill()
The fill()
function is used to fill missing values in a data frame, particularly within columns.
Let’s first make missing values in the covid_vac dataset explicit and assign it to a data frame named covid_vac_comp.
<- covid_vac |>
covid_vac_comp complete(year = full_seq(year, period = 1),
vaccine_type = c("Pfizer", "Moderna", "Novavax"))
We can specify the direction to fill the missing values using the argument .direction
. Remember to specify the list of columns to fill.
|> fill(count, .direction = "down") covid_vac_comp
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | 3 |
2020 | Pfizer | 0 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2022 | Moderna | 63 |
2022 | Novavax | 63 |
2022 | Pfizer | 63 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2024 | Moderna | 9 |
2024 | Novavax | 9 |
2024 | Pfizer | 9 |
2024 | NA | 7 |
Similarly, we can fill upwards as follows:
|> fill(count, .direction = "up") covid_vac_comp
Output
year | vaccine_type | count |
---|---|---|
2020 | Moderna | 3 |
2020 | Novavax | 0 |
2020 | Pfizer | 0 |
2021 | Moderna | 88 |
2021 | Novavax | 51 |
2021 | Pfizer | 63 |
2022 | Moderna | 19 |
2022 | Novavax | 19 |
2022 | Pfizer | 19 |
2023 | Moderna | 19 |
2023 | Novavax | 5 |
2023 | Pfizer | 38 |
2024 | Moderna | 9 |
2024 | Novavax | 7 |
2024 | Pfizer | 7 |
2024 | NA | 7 |
Once the data is structured and organized according to tidy principles, we can begin manipulating and transforming it. The next section illustrates how this can be accomplished using the dplyr
package from the tidyverse package suit.