Exploring the Metabric Dataset

This week’s session is structured as a tutorial, allowing you to actively engage with the tasks outlined below. We will spend time exploring these tasks during the session, followed by a collaborative discussion to address any challenges encountered. The concepts we’ve covered thus far will be revisited and applied extensively.

We will be utilizing the metabric dataset, which should now be somewhat familiar to everyone. Our workflow will involve downloading the original data from cBioPortal, reading it into our R session, and then proceeding to preprocess and transform the data to better suit our analytical needs. This results in a data frame named metabric.

The following table illustartes the column names and descriptions of the metabric data frame we will be using for subsequent analysis.

Description of column names in the metabric dataset
Column Name Description
Patient_Id #Identifier to uniquely specify a patient.
Lymph_Nodes_Examined_Positive Number of lymphnodes positive
Npi Nottingham prognostic index
Cellularity Tumor Content
Chemotherapy Chemotherapy.
Cohort Cohort
Er_Ihc ER status measured by IHC
Her2_Snp6 HER2 status measured by SNP6
Intclust Integrative Cluster
Age_At_Diagnosis Age at Diagnosis
Os_Months Overall survival in months since initial diagonosis.
Os_Status Overall patient survival status.
Claudin_Subtype Pam50 + Claudin-low subtype.
Threegene 3-Gene classifier subtype
Vital_Status The survival state of the person.
Radio_Therapy Radio Therapy
Cancer_Type Cancer Type
Cancer_Type_Detailed Cancer Type Detailed
Er_Status ER Status
Her2_Status HER2 Status
Grade Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness.
Oncotree_Code Oncotree Code
Pr_Status PR Status
Sample_Type The type of sample (i.e., normal, primary, met, recurrence).
Tumor_Size Tumor size in mm.
Tumor_Stage Tumor stage.
Tmb_Nonsynonymous TMB (nonsynonymous)
FOXA1 FOXA1 Expression data
MLPH MLPH Expression data
ESR1 ESR1 Expression data
ERBB2 ERBB2 Expression data
TP53 TP53 Expression data
PIK3CA PIK3CA Expression data
GATA3 GATA3 Expression data
PGR PGR Expression data

Finally, we will conduct various statistical analyses and create visualizations to effectively interpret our data.

Download the Data

Follow this link to download the metabric dataset and copy the downloaded folder into the data folder in your working directory (i.e., IntroR folder).

Task 1: Patient and Sample Data

1. Import the clinical patient data file into a data frame named patient_info. If the import is successful, the expected output for the head() command would be as follows.

PATIENT_ID LYMPH_NODES_EXAMINED_POSITIVE NPI CELLULARITY CHEMOTHERAPY COHORT ER_IHC HER2_SNP6 HORMONE_THERAPY INFERRED_MENOPAUSAL_STATE SEX INTCLUST AGE_AT_DIAGNOSIS OS_MONTHS OS_STATUS CLAUDIN_SUBTYPE THREEGENE VITAL_STATUS LATERALITY RADIO_THERAPY HISTOLOGICAL_SUBTYPE BREAST_SURGERY RFS_STATUS RFS_MONTHS
MB-0000 10 6.044 NA NO 1 Positve NEUTRAL YES Post Female 4ER+ 75.65 140.50000 0:LIVING claudin-low ER-/HER2- Living Right YES Ductal/NST MASTECTOMY 0:Not Recurred 138.65
MB-0002 0 4.020 High NO 1 Positve NEUTRAL YES Pre Female 4ER+ 43.19 84.63333 0:LIVING LumA ER+/HER2- High Prolif Living Right YES Ductal/NST BREAST CONSERVING 0:Not Recurred 83.52
MB-0005 1 4.030 High YES 1 Positve NEUTRAL YES Pre Female 3 48.87 163.70000 1:DECEASED LumB NA Died of Disease Right NO Ductal/NST MASTECTOMY 1:Recurred 151.28
MB-0006 3 4.050 Moderate YES 1 Positve NEUTRAL YES Pre Female 9 47.68 164.93333 0:LIVING LumB NA Living Right YES Mixed MASTECTOMY 0:Not Recurred 162.76
MB-0008 8 6.080 High YES 1 Positve NEUTRAL YES Post Female 9 76.97 41.36667 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease Right YES Mixed MASTECTOMY 1:Recurred 18.55
MB-0010 0 4.062 Moderate NO 1 Positve NEUTRAL YES Post Female 7 78.77 7.80000 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease Left YES Ductal/NST MASTECTOMY 1:Recurred 2.89

2. Filter the patient_info dataframe to exclude data from a different study where patient ID starts with “MTS”, keeping only the data where patient ID has the form “MB-xxxx”.

The expected output for dimensions of the data frame:

[1] 1985   24

3. Remove all columns with indices 9, 10, 11, 19, 21, 22, 23, and 24 from the dataset, then organize the remaining data in ascending order based on patient IDs.

The expected column names of the data frame:

 [1] "PATIENT_ID"                    "LYMPH_NODES_EXAMINED_POSITIVE"
 [3] "NPI"                           "CELLULARITY"                  
 [5] "CHEMOTHERAPY"                  "COHORT"                       
 [7] "ER_IHC"                        "HER2_SNP6"                    
 [9] "INTCLUST"                      "AGE_AT_DIAGNOSIS"             
[11] "OS_MONTHS"                     "OS_STATUS"                    
[13] "CLAUDIN_SUBTYPE"               "THREEGENE"                    
[15] "VITAL_STATUS"                  "RADIO_THERAPY"                

4. Convert the column names of the above dataframe to Snake_Case.

The expected column names and the dimensions of the patient_info dataframe:

 [1] "Patient_Id"                    "Lymph_Nodes_Examined_Positive"
 [3] "Npi"                           "Cellularity"                  
 [5] "Chemotherapy"                  "Cohort"                       
 [7] "Er_Ihc"                        "Her2_Snp6"                    
 [9] "Intclust"                      "Age_At_Diagnosis"             
[11] "Os_Months"                     "Os_Status"                    
[13] "Claudin_Subtype"               "Threegene"                    
[15] "Vital_Status"                  "Radio_Therapy"                
[1] 1985   16

5. Import the clinical sample data file into a data frame named sample_info and convert the column names to Snake_Case.

The expected output for the head() command:

Patient_Id Sample_Id Cancer_Type Cancer_Type_Detailed Er_Status Her2_Status Grade Oncotree_Code Pr_Status Sample_Type Tumor_Size Tumor_Stage Tmb_Nonsynonymous
MB-0000 MB-0000 Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Negative Primary 22 2 0.000000
MB-0002 MB-0002 Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 10 1 2.615035
MB-0005 MB-0005 Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 2 IDC Positive Primary 15 2 2.615035
MB-0006 MB-0006 Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 2 MDLC Positive Primary 25 2 1.307518
MB-0008 MB-0008 Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 3 MDLC Positive Primary 40 2 2.615035
MB-0010 MB-0010 Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 31 4 5.230071

6. Join the two data frames to create a data frame named patient_sample_data and remove columns that contain identical values.

The expected output for the head() command:

Patient_Id Lymph_Nodes_Examined_Positive Npi Cellularity Chemotherapy Cohort Er_Ihc Her2_Snp6 Intclust Age_At_Diagnosis Os_Months Os_Status Claudin_Subtype Threegene Vital_Status Radio_Therapy Cancer_Type Cancer_Type_Detailed Er_Status Her2_Status Grade Oncotree_Code Pr_Status Sample_Type Tumor_Size Tumor_Stage Tmb_Nonsynonymous
MB-0000 10 6.044 NA NO 1 Positve NEUTRAL 4ER+ 75.65 140.50000 0:LIVING claudin-low ER-/HER2- Living YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Negative Primary 22 2 0.000000
MB-0002 0 4.020 High NO 1 Positve NEUTRAL 4ER+ 43.19 84.63333 0:LIVING LumA ER+/HER2- High Prolif Living YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 10 1 2.615035
MB-0005 1 4.030 High YES 1 Positve NEUTRAL 3 48.87 163.70000 1:DECEASED LumB NA Died of Disease NO Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 2 IDC Positive Primary 15 2 2.615035
MB-0006 3 4.050 Moderate YES 1 Positve NEUTRAL 9 47.68 164.93333 0:LIVING LumB NA Living YES Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 2 MDLC Positive Primary 25 2 1.307518
MB-0008 8 6.080 High YES 1 Positve NEUTRAL 9 76.97 41.36667 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 3 MDLC Positive Primary 40 2 2.615035
MB-0010 0 4.062 Moderate NO 1 Positve NEUTRAL 7 78.77 7.80000 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 31 4 5.230071

Task 2: Expression data

1. Read the microarray data into data frame named mrna. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.

2. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column.

The expected dimensions of the data frame:

[1]    8 1981

3. To facilitate the join of patient_sample_data and the above data frame in the next task, transform the above data frame to the following format.

The top 6 rows of the resulting data frame:

Patient_Id FOXA1 MLPH ESR1 ERBB2 TP53 PIK3CA GATA3 PGR
MB-0362 11.87394 12.17512 9.082492 10.872596 6.270846 5.323543 10.746820 6.120169
MB-0346 10.82992 11.69353 6.017776 13.836037 6.814312 5.214869 9.497442 5.291619
MB-0386 11.41743 11.67884 10.193813 9.906015 6.280620 5.920523 9.909838 7.416926
MB-0574 11.26630 11.47865 12.412887 10.779680 6.918135 5.716273 10.077291 7.313249
MB-0185 11.55534 10.36040 9.046294 9.871677 6.439265 5.505726 10.213885 7.616044
MB-0503 11.04414 11.96047 11.167178 11.289801 6.165499 5.605856 10.709767 7.450691

4. Join the above data frame and the data frame patient_sample_data created in Task 2 to create clinical_and_expression_data dataframe.

The top 6 rows, column names and the dimensions of the final data frame:

Patient_Id Lymph_Nodes_Examined_Positive Npi Cellularity Chemotherapy Cohort Er_Ihc Her2_Snp6 Intclust Age_At_Diagnosis Os_Months Os_Status Claudin_Subtype Threegene Vital_Status Radio_Therapy Cancer_Type Cancer_Type_Detailed Er_Status Her2_Status Grade Oncotree_Code Pr_Status Sample_Type Tumor_Size Tumor_Stage Tmb_Nonsynonymous FOXA1 MLPH ESR1 ERBB2 TP53 PIK3CA GATA3 PGR
MB-0000 10 6.044 NA NO 1 Positve NEUTRAL 4ER+ 75.65 140.50000 0:LIVING claudin-low ER-/HER2- Living YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Negative Primary 22 2 0.000000 7.953794 9.729728 8.929817 9.333972 6.338739 5.704157 6.932146 5.680501
MB-0002 0 4.020 High NO 1 Positve NEUTRAL 4ER+ 43.19 84.63333 0:LIVING LumA ER+/HER2- High Prolif Living YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 10 1 2.615035 11.843989 12.536570 10.047059 9.729606 6.192507 5.757727 11.251197 7.505424
MB-0005 1 4.030 High YES 1 Positve NEUTRAL 3 48.87 163.70000 1:DECEASED LumB NA Died of Disease NO Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 2 IDC Positive Primary 15 2 2.615035 11.698169 10.306115 10.041281 9.725825 6.404516 6.751566 9.289758 7.376123
MB-0006 3 4.050 Moderate YES 1 Positve NEUTRAL 9 47.68 164.93333 0:LIVING LumB NA Living YES Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 2 MDLC Positive Primary 25 2 1.307518 11.863379 10.472181 10.404685 10.334979 6.869241 7.219187 8.667723 6.815637
MB-0008 8 6.080 High YES 1 Positve NEUTRAL 9 76.97 41.36667 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Cancer Breast Mixed Ductal and Lobular Carcinoma Positive Negative 3 MDLC Positive Primary 40 2 2.615035 11.625006 12.161961 11.276581 9.956267 6.337951 5.817818 9.719781 7.331223
MB-0010 0 4.062 Moderate NO 1 Positve NEUTRAL 7 78.77 7.80000 1:DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Cancer Breast Invasive Ductal Carcinoma Positive Negative 3 IDC Positive Primary 31 4 5.230071 12.142178 11.433164 11.239750 9.739996 5.419711 6.123056 9.787085 5.954311
 [1] "Patient_Id"                    "Lymph_Nodes_Examined_Positive"
 [3] "Npi"                           "Cellularity"                  
 [5] "Chemotherapy"                  "Cohort"                       
 [7] "Er_Ihc"                        "Her2_Snp6"                    
 [9] "Intclust"                      "Age_At_Diagnosis"             
[11] "Os_Months"                     "Os_Status"                    
[13] "Claudin_Subtype"               "Threegene"                    
[15] "Vital_Status"                  "Radio_Therapy"                
[17] "Cancer_Type"                   "Cancer_Type_Detailed"         
[19] "Er_Status"                     "Her2_Status"                  
[21] "Grade"                         "Oncotree_Code"                
[23] "Pr_Status"                     "Sample_Type"                  
[25] "Tumor_Size"                    "Tumor_Stage"                  
[27] "Tmb_Nonsynonymous"             "FOXA1"                        
[29] "MLPH"                          "ESR1"                         
[31] "ERBB2"                         "TP53"                         
[33] "PIK3CA"                        "GATA3"                        
[35] "PGR"                          
[1] 1985   35

5. Notice that one of the columns contains key value pairs in the form key:value. For example: 0:LIVING, 1:DECEASED. Update column to retain only the values.

6. Rename the data frame to metabric for all subsequent analysis.

Task 3: Basic Analysis

1. Find the total number of patients who were still alive at the time of the study and had survived for more than 10 years (120 months).

2. Display only the columns of interest in the decreasing order of overall survivival time.

3. How many samples in the METABRIC dataset have high cellularity but have no recorded classification with the 3-gene classifier?

4. Round all columns containing expression data to two decimal places and display only the relevant columns alongside the patient IDs.

Hint: see help page of mutate_at(). The round() function is useful for rounding numerical values to a specified number of decimal places.

Task 4: Nottingham Prognostic Index

1. Create a new column for corrected NPI using the following equation (try to do it yourself for practice).

\(\text{NPI} = (0.2 \times \text{Tumor size in centimetres}) + \text{Node status} + \text{Tumour grade}\)

  • \(\text{Node status}\):
    • 0 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 1
    • 1-3 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 2
    • \(>3\) Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 3
  • \(\text{Tumor grade}\):
    • Grade I \(\Rightarrow\) 1
    • Grade II \(\Rightarrow\) 2
    • Grade III \(\Rightarrow\) 3

2. Round both NPI computations to two decimal places and display the results in decreasing order of the newly calculated NPI column.

3. Visualize both NPI compuations against age at diagnosis in the same plot to facilitate comparison.

Task 5: Expression Z-Scores

Some genes are generally expressed at higher levels than others. This can make comparisons of changes between groups for a set of genes somewhat difficult, particularly if the expression for those genes are on very different scales. The expression values in our METABRIC are on a log2 scale which helps to reduce the range of values but another method for representing expression measurements is to standardize these to produce z-scores.

Standardization of a set of measurements involves subtracting the mean from each and dividing by the standard deviation. This will produce values with a mean of 0 and a standard deviation of 1.

1. Create a modified version of the metabric data frame containing a new column with the standardized expression values (z-scores) for the ERBB2 gene.

2. Verify the computation by calculating the mean and standard deviation of the new z-score variable.

3. Add another column to the modified metabric data frame containing a z-score for GATA3 and then create a scatter plot of the z-scores of GATA3 against ERBB2. Modify the plot to facet by the Claudin_Subtype classification.

4. Standardize the expression values for all genes in a single operation using an anonymous function, overwriting their original values, and round the resulting values to 3 significant figures.

Hint: refer mutate_at()

5. Verify the computation by calculating the mean and standard deviation for each column.

Hint: check summarise_at()

6. Create a plot comparing the distribution of standardized expression values for TP53 against a normal distribution.

Hint: use stat_function()

Task 6: PIK3CA Mutations across Integrative Clusters

Figure 5a from the METABRIC 2016 paper compares the percentage of samples that have non-silent mutations for various genes within each of the Integrative Clusters. In this exercise, we’ll reproduce a version of one of these bar charts for the PIK3CA gene.

1. Read the mutations that were detected in each patient tumour sample and exclude data from a different study where sample barcode starts with “MTS”, keeping only the data where barcode has the form “MB-xxxx”.

2. Use the following list to arrange and select the desdired columns for subsequent analysis.

Patient_Id = Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Strand, Reference_Allele, contains("Tumor_Seq"), starts_with("Variant"), Codon = Protein_position, Gene = Hugo_Symbol, starts_with("HGVS")

3. Find the most frequently mutated codons in TP53.

These frequently mutated loci are known as hotspots.

4. What type of mutations are found at the most prevalent hotspot in TP53, codon 248? What changes are occurring at the DNA and protein level?

5. Create a data frame containing the number of non-silent PIK3CA mutations in each patient sample. It should have two columns, Patient_ID and PIK3CA_mutation_count.

6. Join the above table and the metabric data frame to include PIK3CA mutations

7. Create a new column named Has_PIK3CA_mutation containing logical values (i.e., TRUE or FALSE).

Hint: use is.na() function

8. Compute the proportion of patient samples within each Integrative Cluster with a PIK3CA mutation. Exclude those that have not been classified into one of the Integrative Cluster subtypes.

9. Plot the above percentages as a bar chart.

Hint: Look at the help page for geom_bar and geom_col to see which function to use.

10. Customize your plot in the following ways:

  • add a title
  • hide legend
  • change the axis labels so they don’t contain underscores
  • set the limits on the y axis so that the entire 0 - 100% range is displayed
  • set breaks on the x axis to be at 20% intervals
  • reduce the amount of space between the bars and the tick marks for each integrative cluster
  • apply the Viridis colour scheme for the bars (scale_fill_viridis_d)
  • choose the black and white theme by appending + theme_bw() to the plot
  • remove the vertical grid lines by appending + theme(panel.grid.major.x = element_blank())

Compare the plot with the equivalent plot in Figure 5a from the metabric mutation paper.

11. Create a similar plot for GATA3 and TP53 and compare those given in Figure 5a from the metabric mutation paper

12. Save each plot as a PDF using ggsave()

Task 7: Relate Copy Number and Expression Data

1. Read the copy number data into a data frame named copy_number_states. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.

2. Convert it into a tidy format with a row-per-sample-per-gene where the values are the copy number states.

3. Remove the Entrez_Gene_id column and convert the copy number state variable into a factor using the following names for each state.

  • -2 deletion
  • -1 loss
  • 0 neutral
  • 1 gain
  • 2 amplification

4. Count the total number of occurrences of each copy number state.

5. Create a combined data set containing the expression value and copy number state for each patient and gene pairing. The data frame should contain the columns: Sample, Gene, Expression_Date and Copy_Number_State.

6. Create a series of box plots for each of the genes with a box-and-whiskers showing the range of expression values for each copy number state.

7. Customize this plot by changing the labels, scales, colours and theme as you like – be creative! Save the plot as a PDF using ggsave()