Exploring the Metabric Dataset

This week’s session is structured as a tutorial, allowing you to actively engage with the tasks outlined below. We will spend time exploring these tasks during the session, followed by a collaborative discussion to address any challenges encountered. The concepts we’ve covered thus far will be revisited and applied extensively.

We will be utilizing the metabric dataset, which should now be somewhat familiar to everyone. Our workflow will involve downloading the original data from cBioPortal, reading it into our R session, and then proceeding to preprocess and transform the data to better suit our analytical needs. This results in a data frame named metabric.

The following table illustartes the column names and descriptions of the metabric data frame we will be using for subsequent analysis.

Description of column names in the metabric dataset
Column Name	Description
Patient_Id	#Identifier to uniquely specify a patient.
Lymph_Nodes_Examined_Positive	Number of lymphnodes positive
Npi	Nottingham prognostic index
Cellularity	Tumor Content
Chemotherapy	Chemotherapy.
Cohort	Cohort
Er_Ihc	ER status measured by IHC
Her2_Snp6	HER2 status measured by SNP6
Intclust	Integrative Cluster
Age_At_Diagnosis	Age at Diagnosis
Os_Months	Overall survival in months since initial diagonosis.
Os_Status	Overall patient survival status.
Claudin_Subtype	Pam50 + Claudin-low subtype.
Threegene	3-Gene classifier subtype
Vital_Status	The survival state of the person.
Radio_Therapy	Radio Therapy
Cancer_Type	Cancer Type
Cancer_Type_Detailed	Cancer Type Detailed
Er_Status	ER Status
Her2_Status	HER2 Status
Grade	Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness.
Oncotree_Code	Oncotree Code
Pr_Status	PR Status
Sample_Type	The type of sample (i.e., normal, primary, met, recurrence).
Tumor_Size	Tumor size in mm.
Tumor_Stage	Tumor stage.
Tmb_Nonsynonymous	TMB (nonsynonymous)
FOXA1	FOXA1 Expression data
MLPH	MLPH Expression data
ESR1	ESR1 Expression data
ERBB2	ERBB2 Expression data
TP53	TP53 Expression data
PIK3CA	PIK3CA Expression data
GATA3	GATA3 Expression data
PGR	PGR Expression data

Finally, we will conduct various statistical analyses and create visualizations to effectively interpret our data.

Download the Data

Follow this link to download the metabric dataset and copy the downloaded folder into the data folder in your working directory (i.e., IntroR folder).

Task 1: Patient and Sample Data

1. Import the clinical patient data file into a data frame named patient_info. If the import is successful, the expected output for the head() command would be as follows.

PATIENT_ID	LYMPH_NODES_EXAMINED_POSITIVE	NPI	CELLULARITY	CHEMOTHERAPY	COHORT	ER_IHC	HER2_SNP6	HORMONE_THERAPY	INFERRED_MENOPAUSAL_STATE	SEX	INTCLUST	AGE_AT_DIAGNOSIS	OS_MONTHS	OS_STATUS	CLAUDIN_SUBTYPE	THREEGENE	VITAL_STATUS	LATERALITY	RADIO_THERAPY	HISTOLOGICAL_SUBTYPE	BREAST_SURGERY	RFS_STATUS	RFS_MONTHS
MB-0000	10	6.044	NA	NO	1	Positve	NEUTRAL	YES	Post	Female	4ER+	75.65	140.50000	0:LIVING	claudin-low	ER-/HER2-	Living	Right	YES	Ductal/NST	MASTECTOMY	0:Not Recurred	138.65
MB-0002	0	4.020	High	NO	1	Positve	NEUTRAL	YES	Pre	Female	4ER+	43.19	84.63333	0:LIVING	LumA	ER+/HER2- High Prolif	Living	Right	YES	Ductal/NST	BREAST CONSERVING	0:Not Recurred	83.52
MB-0005	1	4.030	High	YES	1	Positve	NEUTRAL	YES	Pre	Female	3	48.87	163.70000	1:DECEASED	LumB	NA	Died of Disease	Right	NO	Ductal/NST	MASTECTOMY	1:Recurred	151.28
MB-0006	3	4.050	Moderate	YES	1	Positve	NEUTRAL	YES	Pre	Female	9	47.68	164.93333	0:LIVING	LumB	NA	Living	Right	YES	Mixed	MASTECTOMY	0:Not Recurred	162.76
MB-0008	8	6.080	High	YES	1	Positve	NEUTRAL	YES	Post	Female	9	76.97	41.36667	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	Right	YES	Mixed	MASTECTOMY	1:Recurred	18.55
MB-0010	0	4.062	Moderate	NO	1	Positve	NEUTRAL	YES	Post	Female	7	78.77	7.80000	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	Left	YES	Ductal/NST	MASTECTOMY	1:Recurred	2.89

2. Filter the patient_info dataframe to exclude data from a different study where patient ID starts with “MTS”, keeping only the data where patient ID has the form “MB-xxxx”.

The expected output for dimensions of the data frame:

[1] 1985   24

3. Remove all columns with indices 9, 10, 11, 19, 21, 22, 23, and 24 from the dataset, then organize the remaining data in ascending order based on patient IDs.

The expected column names of the data frame:

 [1] "PATIENT_ID"                    "LYMPH_NODES_EXAMINED_POSITIVE"
 [3] "NPI"                           "CELLULARITY"                  
 [5] "CHEMOTHERAPY"                  "COHORT"                       
 [7] "ER_IHC"                        "HER2_SNP6"                    
 [9] "INTCLUST"                      "AGE_AT_DIAGNOSIS"             
[11] "OS_MONTHS"                     "OS_STATUS"                    
[13] "CLAUDIN_SUBTYPE"               "THREEGENE"                    
[15] "VITAL_STATUS"                  "RADIO_THERAPY"

4. Convert the column names of the above dataframe to Snake_Case.

The expected column names and the dimensions of the patient_info dataframe:

 [1] "Patient_Id"                    "Lymph_Nodes_Examined_Positive"
 [3] "Npi"                           "Cellularity"                  
 [5] "Chemotherapy"                  "Cohort"                       
 [7] "Er_Ihc"                        "Her2_Snp6"                    
 [9] "Intclust"                      "Age_At_Diagnosis"             
[11] "Os_Months"                     "Os_Status"                    
[13] "Claudin_Subtype"               "Threegene"                    
[15] "Vital_Status"                  "Radio_Therapy"

[1] 1985   16

5. Import the clinical sample data file into a data frame named sample_info and convert the column names to Snake_Case.

The expected output for the head() command:

Patient_Id	Sample_Id	Cancer_Type	Cancer_Type_Detailed	Er_Status	Her2_Status	Grade	Oncotree_Code	Pr_Status	Sample_Type	Tumor_Size	Tumor_Stage	Tmb_Nonsynonymous
MB-0000	MB-0000	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Negative	Primary	22	2	0.000000
MB-0002	MB-0002	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	10	1	2.615035
MB-0005	MB-0005	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	2	IDC	Positive	Primary	15	2	2.615035
MB-0006	MB-0006	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	2	MDLC	Positive	Primary	25	2	1.307518
MB-0008	MB-0008	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	3	MDLC	Positive	Primary	40	2	2.615035
MB-0010	MB-0010	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	31	4	5.230071

6. Join the two data frames to create a data frame named patient_sample_data and remove columns that contain identical values.

The expected output for the head() command:

Patient_Id	Lymph_Nodes_Examined_Positive	Npi	Cellularity	Chemotherapy	Cohort	Er_Ihc	Her2_Snp6	Intclust	Age_At_Diagnosis	Os_Months	Os_Status	Claudin_Subtype	Threegene	Vital_Status	Radio_Therapy	Cancer_Type	Cancer_Type_Detailed	Er_Status	Her2_Status	Grade	Oncotree_Code	Pr_Status	Sample_Type	Tumor_Size	Tumor_Stage	Tmb_Nonsynonymous
MB-0000	10	6.044	NA	NO	1	Positve	NEUTRAL	4ER+	75.65	140.50000	0:LIVING	claudin-low	ER-/HER2-	Living	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Negative	Primary	22	2	0.000000
MB-0002	0	4.020	High	NO	1	Positve	NEUTRAL	4ER+	43.19	84.63333	0:LIVING	LumA	ER+/HER2- High Prolif	Living	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	10	1	2.615035
MB-0005	1	4.030	High	YES	1	Positve	NEUTRAL	3	48.87	163.70000	1:DECEASED	LumB	NA	Died of Disease	NO	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	2	IDC	Positive	Primary	15	2	2.615035
MB-0006	3	4.050	Moderate	YES	1	Positve	NEUTRAL	9	47.68	164.93333	0:LIVING	LumB	NA	Living	YES	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	2	MDLC	Positive	Primary	25	2	1.307518
MB-0008	8	6.080	High	YES	1	Positve	NEUTRAL	9	76.97	41.36667	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	YES	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	3	MDLC	Positive	Primary	40	2	2.615035
MB-0010	0	4.062	Moderate	NO	1	Positve	NEUTRAL	7	78.77	7.80000	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	31	4	5.230071

Task 2: Expression data

1. Read the microarray data into data frame named mrna. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.

2. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column.

The expected dimensions of the data frame:

[1]    8 1981

3. To facilitate the join of patient_sample_data and the above data frame in the next task, transform the above data frame to the following format.

The top 6 rows of the resulting data frame:

Patient_Id	FOXA1	MLPH	ESR1	ERBB2	TP53	PIK3CA	GATA3	PGR
MB-0362	11.87394	12.17512	9.082492	10.872596	6.270846	5.323543	10.746820	6.120169
MB-0346	10.82992	11.69353	6.017776	13.836037	6.814312	5.214869	9.497442	5.291619
MB-0386	11.41743	11.67884	10.193813	9.906015	6.280620	5.920523	9.909838	7.416926
MB-0574	11.26630	11.47865	12.412887	10.779680	6.918135	5.716273	10.077291	7.313249
MB-0185	11.55534	10.36040	9.046294	9.871677	6.439265	5.505726	10.213885	7.616044
MB-0503	11.04414	11.96047	11.167178	11.289801	6.165499	5.605856	10.709767	7.450691

4. Join the above data frame and the data frame patient_sample_data created in Task 2 to create clinical_and_expression_data dataframe.

The top 6 rows, column names and the dimensions of the final data frame:

Patient_Id	Lymph_Nodes_Examined_Positive	Npi	Cellularity	Chemotherapy	Cohort	Er_Ihc	Her2_Snp6	Intclust	Age_At_Diagnosis	Os_Months	Os_Status	Claudin_Subtype	Threegene	Vital_Status	Radio_Therapy	Cancer_Type	Cancer_Type_Detailed	Er_Status	Her2_Status	Grade	Oncotree_Code	Pr_Status	Sample_Type	Tumor_Size	Tumor_Stage	Tmb_Nonsynonymous	FOXA1	MLPH	ESR1	ERBB2	TP53	PIK3CA	GATA3	PGR
MB-0000	10	6.044	NA	NO	1	Positve	NEUTRAL	4ER+	75.65	140.50000	0:LIVING	claudin-low	ER-/HER2-	Living	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Negative	Primary	22	2	0.000000	7.953794	9.729728	8.929817	9.333972	6.338739	5.704157	6.932146	5.680501
MB-0002	0	4.020	High	NO	1	Positve	NEUTRAL	4ER+	43.19	84.63333	0:LIVING	LumA	ER+/HER2- High Prolif	Living	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	10	1	2.615035	11.843989	12.536570	10.047059	9.729606	6.192507	5.757727	11.251197	7.505424
MB-0005	1	4.030	High	YES	1	Positve	NEUTRAL	3	48.87	163.70000	1:DECEASED	LumB	NA	Died of Disease	NO	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	2	IDC	Positive	Primary	15	2	2.615035	11.698169	10.306115	10.041281	9.725825	6.404516	6.751566	9.289758	7.376123
MB-0006	3	4.050	Moderate	YES	1	Positve	NEUTRAL	9	47.68	164.93333	0:LIVING	LumB	NA	Living	YES	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	2	MDLC	Positive	Primary	25	2	1.307518	11.863379	10.472181	10.404685	10.334979	6.869241	7.219187	8.667723	6.815637
MB-0008	8	6.080	High	YES	1	Positve	NEUTRAL	9	76.97	41.36667	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	YES	Breast Cancer	Breast Mixed Ductal and Lobular Carcinoma	Positive	Negative	3	MDLC	Positive	Primary	40	2	2.615035	11.625006	12.161961	11.276581	9.956267	6.337951	5.817818	9.719781	7.331223
MB-0010	0	4.062	Moderate	NO	1	Positve	NEUTRAL	7	78.77	7.80000	1:DECEASED	LumB	ER+/HER2- High Prolif	Died of Disease	YES	Breast Cancer	Breast Invasive Ductal Carcinoma	Positive	Negative	3	IDC	Positive	Primary	31	4	5.230071	12.142178	11.433164	11.239750	9.739996	5.419711	6.123056	9.787085	5.954311

 [1] "Patient_Id"                    "Lymph_Nodes_Examined_Positive"
 [3] "Npi"                           "Cellularity"                  
 [5] "Chemotherapy"                  "Cohort"                       
 [7] "Er_Ihc"                        "Her2_Snp6"                    
 [9] "Intclust"                      "Age_At_Diagnosis"             
[11] "Os_Months"                     "Os_Status"                    
[13] "Claudin_Subtype"               "Threegene"                    
[15] "Vital_Status"                  "Radio_Therapy"                
[17] "Cancer_Type"                   "Cancer_Type_Detailed"         
[19] "Er_Status"                     "Her2_Status"                  
[21] "Grade"                         "Oncotree_Code"                
[23] "Pr_Status"                     "Sample_Type"                  
[25] "Tumor_Size"                    "Tumor_Stage"                  
[27] "Tmb_Nonsynonymous"             "FOXA1"                        
[29] "MLPH"                          "ESR1"                         
[31] "ERBB2"                         "TP53"                         
[33] "PIK3CA"                        "GATA3"                        
[35] "PGR"

[1] 1985   35

5. Notice that one of the columns contains key value pairs in the form key:value. For example: 0:LIVING, 1:DECEASED. Update column to retain only the values.

6. Rename the data frame to metabric for all subsequent analysis.

Task 3: Basic Analysis

1. Find the total number of patients who were still alive at the time of the study and had survived for more than 10 years (120 months).

2. Display only the columns of interest in the decreasing order of overall survivival time.

3. How many samples in the METABRIC dataset have high cellularity but have no recorded classification with the 3-gene classifier?

4. Round all columns containing expression data to two decimal places and display only the relevant columns alongside the patient IDs.

Hint: see help page of mutate_at(). The round() function is useful for rounding numerical values to a specified number of decimal places.

Task 4: Nottingham Prognostic Index

1. Create a new column for corrected NPI using the following equation (try to do it yourself for practice).

\(\text{NPI} = (0.2 \times \text{Tumor size in centimetres}) + \text{Node status} + \text{Tumour grade}\)

\(\text{Node status}\):
- 0 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 1
- 1-3 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 2
- \(>3\) Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 3
\(\text{Tumor grade}\):
- Grade I \(\Rightarrow\) 1
- Grade II \(\Rightarrow\) 2
- Grade III \(\Rightarrow\) 3

2. Round both NPI computations to two decimal places and display the results in decreasing order of the newly calculated NPI column.

3. Visualize both NPI compuations against age at diagnosis in the same plot to facilitate comparison.

Task 5: Expression Z-Scores

Some genes are generally expressed at higher levels than others. This can make comparisons of changes between groups for a set of genes somewhat difficult, particularly if the expression for those genes are on very different scales. The expression values in our METABRIC are on a log2 scale which helps to reduce the range of values but another method for representing expression measurements is to standardize these to produce z-scores.

Standardization of a set of measurements involves subtracting the mean from each and dividing by the standard deviation. This will produce values with a mean of 0 and a standard deviation of 1.

1. Create a modified version of the metabric data frame containing a new column with the standardized expression values (z-scores) for the ERBB2 gene.

2. Verify the computation by calculating the mean and standard deviation of the new z-score variable.

3. Add another column to the modified metabric data frame containing a z-score for GATA3 and then create a scatter plot of the z-scores of GATA3 against ERBB2. Modify the plot to facet by the Claudin_Subtype classification.

4. Standardize the expression values for all genes in a single operation using an anonymous function, overwriting their original values, and round the resulting values to 3 significant figures.

Hint: refer mutate_at()

5. Verify the computation by calculating the mean and standard deviation for each column.

Hint: check summarise_at()

6. Create a plot comparing the distribution of standardized expression values for TP53 against a normal distribution.

Hint: use stat_function()

Task 6: PIK3CA Mutations across Integrative Clusters

Figure 5a from the METABRIC 2016 paper compares the percentage of samples that have non-silent mutations for various genes within each of the Integrative Clusters. In this exercise, we’ll reproduce a version of one of these bar charts for the PIK3CA gene.

1. Read the mutations that were detected in each patient tumour sample and exclude data from a different study where sample barcode starts with “MTS”, keeping only the data where barcode has the form “MB-xxxx”.

2. Use the following list to arrange and select the desdired columns for subsequent analysis.

Patient_Id = Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Strand, Reference_Allele, contains("Tumor_Seq"), starts_with("Variant"), Codon = Protein_position, Gene = Hugo_Symbol, starts_with("HGVS")

3. Find the most frequently mutated codons in TP53.

These frequently mutated loci are known as hotspots.

4. What type of mutations are found at the most prevalent hotspot in TP53, codon 248? What changes are occurring at the DNA and protein level?

5. Create a data frame containing the number of non-silent PIK3CA mutations in each patient sample. It should have two columns, Patient_ID and PIK3CA_mutation_count.

6. Join the above table and the metabric data frame to include PIK3CA mutations

7. Create a new column named Has_PIK3CA_mutation containing logical values (i.e., TRUE or FALSE).

Hint: use is.na() function

8. Compute the proportion of patient samples within each Integrative Cluster with a PIK3CA mutation. Exclude those that have not been classified into one of the Integrative Cluster subtypes.

9. Plot the above percentages as a bar chart.

Hint: Look at the help page for geom_bar and geom_col to see which function to use.

10. Customize your plot in the following ways:

add a title
hide legend
change the axis labels so they don’t contain underscores
set the limits on the y axis so that the entire 0 - 100% range is displayed
set breaks on the x axis to be at 20% intervals
reduce the amount of space between the bars and the tick marks for each integrative cluster
apply the Viridis colour scheme for the bars (scale_fill_viridis_d)
choose the black and white theme by appending + theme_bw() to the plot
remove the vertical grid lines by appending + theme(panel.grid.major.x = element_blank())

Compare the plot with the equivalent plot in Figure 5a from the metabric mutation paper.

11. Create a similar plot for GATA3 and TP53 and compare those given in Figure 5a from the metabric mutation paper

12. Save each plot as a PDF using ggsave()

Task 7: Relate Copy Number and Expression Data

1. Read the copy number data into a data frame named copy_number_states. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.

2. Convert it into a tidy format with a row-per-sample-per-gene where the values are the copy number states.

3. Remove the Entrez_Gene_id column and convert the copy number state variable into a factor using the following names for each state.

-2 deletion
-1 loss
0 neutral
1 gain
2 amplification

4. Count the total number of occurrences of each copy number state.

5. Create a combined data set containing the expression value and copy number state for each patient and gene pairing. The data frame should contain the columns: Sample, Gene, Expression_Date and Copy_Number_State.

6. Create a series of box plots for each of the genes with a box-and-whiskers showing the range of expression values for each copy number state.

7. Customize this plot by changing the labels, scales, colours and theme as you like – be creative! Save the plot as a PDF using ggsave()