[1] 1985 24
Exploring the Metabric Dataset
This week’s session is structured as a tutorial, allowing you to actively engage with the tasks outlined below. We will spend time exploring these tasks during the session, followed by a collaborative discussion to address any challenges encountered. The concepts we’ve covered thus far will be revisited and applied extensively.
We will be utilizing the metabric dataset, which should now be somewhat familiar to everyone. Our workflow will involve downloading the original data from cBioPortal, reading it into our R session, and then proceeding to preprocess and transform the data to better suit our analytical needs. This results in a data frame named metabric.
The following table illustartes the column names and descriptions of the metabric data frame we will be using for subsequent analysis.
Column Name | Description |
---|---|
Patient_Id | #Identifier to uniquely specify a patient. |
Lymph_Nodes_Examined_Positive | Number of lymphnodes positive |
Npi | Nottingham prognostic index |
Cellularity | Tumor Content |
Chemotherapy | Chemotherapy. |
Cohort | Cohort |
Er_Ihc | ER status measured by IHC |
Her2_Snp6 | HER2 status measured by SNP6 |
Intclust | Integrative Cluster |
Age_At_Diagnosis | Age at Diagnosis |
Os_Months | Overall survival in months since initial diagonosis. |
Os_Status | Overall patient survival status. |
Claudin_Subtype | Pam50 + Claudin-low subtype. |
Threegene | 3-Gene classifier subtype |
Vital_Status | The survival state of the person. |
Radio_Therapy | Radio Therapy |
Cancer_Type | Cancer Type |
Cancer_Type_Detailed | Cancer Type Detailed |
Er_Status | ER Status |
Her2_Status | HER2 Status |
Grade | Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness. |
Oncotree_Code | Oncotree Code |
Pr_Status | PR Status |
Sample_Type | The type of sample (i.e., normal, primary, met, recurrence). |
Tumor_Size | Tumor size in mm. |
Tumor_Stage | Tumor stage. |
Tmb_Nonsynonymous | TMB (nonsynonymous) |
FOXA1 | FOXA1 Expression data |
MLPH | MLPH Expression data |
ESR1 | ESR1 Expression data |
ERBB2 | ERBB2 Expression data |
TP53 | TP53 Expression data |
PIK3CA | PIK3CA Expression data |
GATA3 | GATA3 Expression data |
PGR | PGR Expression data |
Finally, we will conduct various statistical analyses and create visualizations to effectively interpret our data.
Download the Data
Follow this link to download the metabric dataset and copy the downloaded folder into the data folder in your working directory (i.e., IntroR folder).
Task 1: Patient and Sample Data
1. Import the clinical patient data file into a data frame named patient_info. If the import is successful, the expected output for the head()
command would be as follows.
PATIENT_ID | LYMPH_NODES_EXAMINED_POSITIVE | NPI | CELLULARITY | CHEMOTHERAPY | COHORT | ER_IHC | HER2_SNP6 | HORMONE_THERAPY | INFERRED_MENOPAUSAL_STATE | SEX | INTCLUST | AGE_AT_DIAGNOSIS | OS_MONTHS | OS_STATUS | CLAUDIN_SUBTYPE | THREEGENE | VITAL_STATUS | LATERALITY | RADIO_THERAPY | HISTOLOGICAL_SUBTYPE | BREAST_SURGERY | RFS_STATUS | RFS_MONTHS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MB-0000 | 10 | 6.044 | NA | NO | 1 | Positve | NEUTRAL | YES | Post | Female | 4ER+ | 75.65 | 140.50000 | 0:LIVING | claudin-low | ER-/HER2- | Living | Right | YES | Ductal/NST | MASTECTOMY | 0:Not Recurred | 138.65 |
MB-0002 | 0 | 4.020 | High | NO | 1 | Positve | NEUTRAL | YES | Pre | Female | 4ER+ | 43.19 | 84.63333 | 0:LIVING | LumA | ER+/HER2- High Prolif | Living | Right | YES | Ductal/NST | BREAST CONSERVING | 0:Not Recurred | 83.52 |
MB-0005 | 1 | 4.030 | High | YES | 1 | Positve | NEUTRAL | YES | Pre | Female | 3 | 48.87 | 163.70000 | 1:DECEASED | LumB | NA | Died of Disease | Right | NO | Ductal/NST | MASTECTOMY | 1:Recurred | 151.28 |
MB-0006 | 3 | 4.050 | Moderate | YES | 1 | Positve | NEUTRAL | YES | Pre | Female | 9 | 47.68 | 164.93333 | 0:LIVING | LumB | NA | Living | Right | YES | Mixed | MASTECTOMY | 0:Not Recurred | 162.76 |
MB-0008 | 8 | 6.080 | High | YES | 1 | Positve | NEUTRAL | YES | Post | Female | 9 | 76.97 | 41.36667 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | Right | YES | Mixed | MASTECTOMY | 1:Recurred | 18.55 |
MB-0010 | 0 | 4.062 | Moderate | NO | 1 | Positve | NEUTRAL | YES | Post | Female | 7 | 78.77 | 7.80000 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | Left | YES | Ductal/NST | MASTECTOMY | 1:Recurred | 2.89 |
2. Filter the patient_info
dataframe to exclude data from a different study where patient ID starts with “MTS”, keeping only the data where patient ID has the form “MB-xxxx”.
The expected output for dimensions of the data frame:
3. Remove all columns with indices 9, 10, 11, 19, 21, 22, 23, and 24 from the dataset, then organize the remaining data in ascending order based on patient IDs.
The expected column names of the data frame:
[1] "PATIENT_ID" "LYMPH_NODES_EXAMINED_POSITIVE"
[3] "NPI" "CELLULARITY"
[5] "CHEMOTHERAPY" "COHORT"
[7] "ER_IHC" "HER2_SNP6"
[9] "INTCLUST" "AGE_AT_DIAGNOSIS"
[11] "OS_MONTHS" "OS_STATUS"
[13] "CLAUDIN_SUBTYPE" "THREEGENE"
[15] "VITAL_STATUS" "RADIO_THERAPY"
4. Convert the column names of the above dataframe to Snake_Case.
The expected column names and the dimensions of the patient_info dataframe:
[1] "Patient_Id" "Lymph_Nodes_Examined_Positive"
[3] "Npi" "Cellularity"
[5] "Chemotherapy" "Cohort"
[7] "Er_Ihc" "Her2_Snp6"
[9] "Intclust" "Age_At_Diagnosis"
[11] "Os_Months" "Os_Status"
[13] "Claudin_Subtype" "Threegene"
[15] "Vital_Status" "Radio_Therapy"
[1] 1985 16
5. Import the clinical sample data file into a data frame named sample_info and convert the column names to Snake_Case.
The expected output for the head()
command:
Patient_Id | Sample_Id | Cancer_Type | Cancer_Type_Detailed | Er_Status | Her2_Status | Grade | Oncotree_Code | Pr_Status | Sample_Type | Tumor_Size | Tumor_Stage | Tmb_Nonsynonymous |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MB-0000 | MB-0000 | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Negative | Primary | 22 | 2 | 0.000000 |
MB-0002 | MB-0002 | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 10 | 1 | 2.615035 |
MB-0005 | MB-0005 | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 2 | IDC | Positive | Primary | 15 | 2 | 2.615035 |
MB-0006 | MB-0006 | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 2 | MDLC | Positive | Primary | 25 | 2 | 1.307518 |
MB-0008 | MB-0008 | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 3 | MDLC | Positive | Primary | 40 | 2 | 2.615035 |
MB-0010 | MB-0010 | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 31 | 4 | 5.230071 |
6. Join the two data frames to create a data frame named patient_sample_data and remove columns that contain identical values.
The expected output for the head()
command:
Patient_Id | Lymph_Nodes_Examined_Positive | Npi | Cellularity | Chemotherapy | Cohort | Er_Ihc | Her2_Snp6 | Intclust | Age_At_Diagnosis | Os_Months | Os_Status | Claudin_Subtype | Threegene | Vital_Status | Radio_Therapy | Cancer_Type | Cancer_Type_Detailed | Er_Status | Her2_Status | Grade | Oncotree_Code | Pr_Status | Sample_Type | Tumor_Size | Tumor_Stage | Tmb_Nonsynonymous |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MB-0000 | 10 | 6.044 | NA | NO | 1 | Positve | NEUTRAL | 4ER+ | 75.65 | 140.50000 | 0:LIVING | claudin-low | ER-/HER2- | Living | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Negative | Primary | 22 | 2 | 0.000000 |
MB-0002 | 0 | 4.020 | High | NO | 1 | Positve | NEUTRAL | 4ER+ | 43.19 | 84.63333 | 0:LIVING | LumA | ER+/HER2- High Prolif | Living | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 10 | 1 | 2.615035 |
MB-0005 | 1 | 4.030 | High | YES | 1 | Positve | NEUTRAL | 3 | 48.87 | 163.70000 | 1:DECEASED | LumB | NA | Died of Disease | NO | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 2 | IDC | Positive | Primary | 15 | 2 | 2.615035 |
MB-0006 | 3 | 4.050 | Moderate | YES | 1 | Positve | NEUTRAL | 9 | 47.68 | 164.93333 | 0:LIVING | LumB | NA | Living | YES | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 2 | MDLC | Positive | Primary | 25 | 2 | 1.307518 |
MB-0008 | 8 | 6.080 | High | YES | 1 | Positve | NEUTRAL | 9 | 76.97 | 41.36667 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 3 | MDLC | Positive | Primary | 40 | 2 | 2.615035 |
MB-0010 | 0 | 4.062 | Moderate | NO | 1 | Positve | NEUTRAL | 7 | 78.77 | 7.80000 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 31 | 4 | 5.230071 |
Task 2: Expression data
1. Read the microarray data into data frame named mrna. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.
2. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column.
The expected dimensions of the data frame:
[1] 8 1981
3. To facilitate the join of patient_sample_data and the above data frame in the next task, transform the above data frame to the following format.
The top 6 rows of the resulting data frame:
Patient_Id | FOXA1 | MLPH | ESR1 | ERBB2 | TP53 | PIK3CA | GATA3 | PGR |
---|---|---|---|---|---|---|---|---|
MB-0362 | 11.87394 | 12.17512 | 9.082492 | 10.872596 | 6.270846 | 5.323543 | 10.746820 | 6.120169 |
MB-0346 | 10.82992 | 11.69353 | 6.017776 | 13.836037 | 6.814312 | 5.214869 | 9.497442 | 5.291619 |
MB-0386 | 11.41743 | 11.67884 | 10.193813 | 9.906015 | 6.280620 | 5.920523 | 9.909838 | 7.416926 |
MB-0574 | 11.26630 | 11.47865 | 12.412887 | 10.779680 | 6.918135 | 5.716273 | 10.077291 | 7.313249 |
MB-0185 | 11.55534 | 10.36040 | 9.046294 | 9.871677 | 6.439265 | 5.505726 | 10.213885 | 7.616044 |
MB-0503 | 11.04414 | 11.96047 | 11.167178 | 11.289801 | 6.165499 | 5.605856 | 10.709767 | 7.450691 |
4. Join the above data frame and the data frame patient_sample_data created in Task 2 to create clinical_and_expression_data dataframe.
The top 6 rows, column names and the dimensions of the final data frame:
Patient_Id | Lymph_Nodes_Examined_Positive | Npi | Cellularity | Chemotherapy | Cohort | Er_Ihc | Her2_Snp6 | Intclust | Age_At_Diagnosis | Os_Months | Os_Status | Claudin_Subtype | Threegene | Vital_Status | Radio_Therapy | Cancer_Type | Cancer_Type_Detailed | Er_Status | Her2_Status | Grade | Oncotree_Code | Pr_Status | Sample_Type | Tumor_Size | Tumor_Stage | Tmb_Nonsynonymous | FOXA1 | MLPH | ESR1 | ERBB2 | TP53 | PIK3CA | GATA3 | PGR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MB-0000 | 10 | 6.044 | NA | NO | 1 | Positve | NEUTRAL | 4ER+ | 75.65 | 140.50000 | 0:LIVING | claudin-low | ER-/HER2- | Living | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Negative | Primary | 22 | 2 | 0.000000 | 7.953794 | 9.729728 | 8.929817 | 9.333972 | 6.338739 | 5.704157 | 6.932146 | 5.680501 |
MB-0002 | 0 | 4.020 | High | NO | 1 | Positve | NEUTRAL | 4ER+ | 43.19 | 84.63333 | 0:LIVING | LumA | ER+/HER2- High Prolif | Living | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 10 | 1 | 2.615035 | 11.843989 | 12.536570 | 10.047059 | 9.729606 | 6.192507 | 5.757727 | 11.251197 | 7.505424 |
MB-0005 | 1 | 4.030 | High | YES | 1 | Positve | NEUTRAL | 3 | 48.87 | 163.70000 | 1:DECEASED | LumB | NA | Died of Disease | NO | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 2 | IDC | Positive | Primary | 15 | 2 | 2.615035 | 11.698169 | 10.306115 | 10.041281 | 9.725825 | 6.404516 | 6.751566 | 9.289758 | 7.376123 |
MB-0006 | 3 | 4.050 | Moderate | YES | 1 | Positve | NEUTRAL | 9 | 47.68 | 164.93333 | 0:LIVING | LumB | NA | Living | YES | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 2 | MDLC | Positive | Primary | 25 | 2 | 1.307518 | 11.863379 | 10.472181 | 10.404685 | 10.334979 | 6.869241 | 7.219187 | 8.667723 | 6.815637 |
MB-0008 | 8 | 6.080 | High | YES | 1 | Positve | NEUTRAL | 9 | 76.97 | 41.36667 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Cancer | Breast Mixed Ductal and Lobular Carcinoma | Positive | Negative | 3 | MDLC | Positive | Primary | 40 | 2 | 2.615035 | 11.625006 | 12.161961 | 11.276581 | 9.956267 | 6.337951 | 5.817818 | 9.719781 | 7.331223 |
MB-0010 | 0 | 4.062 | Moderate | NO | 1 | Positve | NEUTRAL | 7 | 78.77 | 7.80000 | 1:DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Cancer | Breast Invasive Ductal Carcinoma | Positive | Negative | 3 | IDC | Positive | Primary | 31 | 4 | 5.230071 | 12.142178 | 11.433164 | 11.239750 | 9.739996 | 5.419711 | 6.123056 | 9.787085 | 5.954311 |
[1] "Patient_Id" "Lymph_Nodes_Examined_Positive"
[3] "Npi" "Cellularity"
[5] "Chemotherapy" "Cohort"
[7] "Er_Ihc" "Her2_Snp6"
[9] "Intclust" "Age_At_Diagnosis"
[11] "Os_Months" "Os_Status"
[13] "Claudin_Subtype" "Threegene"
[15] "Vital_Status" "Radio_Therapy"
[17] "Cancer_Type" "Cancer_Type_Detailed"
[19] "Er_Status" "Her2_Status"
[21] "Grade" "Oncotree_Code"
[23] "Pr_Status" "Sample_Type"
[25] "Tumor_Size" "Tumor_Stage"
[27] "Tmb_Nonsynonymous" "FOXA1"
[29] "MLPH" "ESR1"
[31] "ERBB2" "TP53"
[33] "PIK3CA" "GATA3"
[35] "PGR"
[1] 1985 35
5. Notice that one of the columns contains key value pairs in the form key:value. For example: 0:LIVING, 1:DECEASED. Update column to retain only the values.
6. Rename the data frame to metabric for all subsequent analysis.
Task 3: Basic Analysis
1. Find the total number of patients who were still alive at the time of the study and had survived for more than 10 years (120 months).
2. Display only the columns of interest in the decreasing order of overall survivival time.
3. How many samples in the METABRIC dataset have high cellularity but have no recorded classification with the 3-gene classifier?
4. Round all columns containing expression data to two decimal places and display only the relevant columns alongside the patient IDs.
Hint: see help page of mutate_at()
. The round()
function is useful for rounding numerical values to a specified number of decimal places.
Task 4: Nottingham Prognostic Index
1. Create a new column for corrected NPI using the following equation (try to do it yourself for practice).
\(\text{NPI} = (0.2 \times \text{Tumor size in centimetres}) + \text{Node status} + \text{Tumour grade}\)
- \(\text{Node status}\):
- 0 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 1
- 1-3 Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 2
- \(>3\) Lymph_Nodes_Examined_Positive \(\Rightarrow\) Node status = 3
- \(\text{Tumor grade}\):
- Grade I \(\Rightarrow\) 1
- Grade II \(\Rightarrow\) 2
- Grade III \(\Rightarrow\) 3
2. Round both NPI computations to two decimal places and display the results in decreasing order of the newly calculated NPI column.
3. Visualize both NPI compuations against age at diagnosis in the same plot to facilitate comparison.
Task 5: Expression Z-Scores
Some genes are generally expressed at higher levels than others. This can make comparisons of changes between groups for a set of genes somewhat difficult, particularly if the expression for those genes are on very different scales. The expression values in our METABRIC are on a log2 scale which helps to reduce the range of values but another method for representing expression measurements is to standardize these to produce z-scores.
Standardization of a set of measurements involves subtracting the mean from each and dividing by the standard deviation. This will produce values with a mean of 0 and a standard deviation of 1.
1. Create a modified version of the metabric data frame containing a new column with the standardized expression values (z-scores) for the ERBB2 gene.
2. Verify the computation by calculating the mean and standard deviation of the new z-score variable.
3. Add another column to the modified metabric data frame containing a z-score for GATA3 and then create a scatter plot of the z-scores of GATA3 against ERBB2. Modify the plot to facet by the Claudin_Subtype classification.
4. Standardize the expression values for all genes in a single operation using an anonymous function, overwriting their original values, and round the resulting values to 3 significant figures.
Hint: refer mutate_at()
5. Verify the computation by calculating the mean and standard deviation for each column.
Hint: check summarise_at()
6. Create a plot comparing the distribution of standardized expression values for TP53 against a normal distribution.
Hint: use stat_function()
Task 6: PIK3CA Mutations across Integrative Clusters
Figure 5a from the METABRIC 2016 paper compares the percentage of samples that have non-silent mutations for various genes within each of the Integrative Clusters. In this exercise, we’ll reproduce a version of one of these bar charts for the PIK3CA gene.
1. Read the mutations that were detected in each patient tumour sample and exclude data from a different study where sample barcode starts with “MTS”, keeping only the data where barcode has the form “MB-xxxx”.
2. Use the following list to arrange and select the desdired columns for subsequent analysis.
Patient_Id = Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Strand, Reference_Allele, contains("Tumor_Seq"), starts_with("Variant"), Codon = Protein_position, Gene = Hugo_Symbol, starts_with("HGVS")
3. Find the most frequently mutated codons in TP53.
These frequently mutated loci are known as hotspots.
4. What type of mutations are found at the most prevalent hotspot in TP53, codon 248? What changes are occurring at the DNA and protein level?
5. Create a data frame containing the number of non-silent PIK3CA mutations in each patient sample. It should have two columns, Patient_ID and PIK3CA_mutation_count.
6. Join the above table and the metabric data frame to include PIK3CA mutations
7. Create a new column named Has_PIK3CA_mutation containing logical values (i.e., TRUE or FALSE).
Hint: use is.na()
function
8. Compute the proportion of patient samples within each Integrative Cluster with a PIK3CA mutation. Exclude those that have not been classified into one of the Integrative Cluster subtypes.
9. Plot the above percentages as a bar chart.
Hint: Look at the help page for geom_bar
and geom_col
to see which function to use.
10. Customize your plot in the following ways:
- add a title
- hide legend
- change the axis labels so they don’t contain underscores
- set the limits on the y axis so that the entire 0 - 100% range is displayed
- set breaks on the x axis to be at 20% intervals
- reduce the amount of space between the bars and the tick marks for each integrative cluster
- apply the Viridis colour scheme for the bars (
scale_fill_viridis_d
) - choose the black and white theme by appending
+ theme_bw()
to the plot - remove the vertical grid lines by appending
+ theme(panel.grid.major.x = element_blank())
Compare the plot with the equivalent plot in Figure 5a from the metabric mutation paper.
11. Create a similar plot for GATA3 and TP53 and compare those given in Figure 5a from the metabric mutation paper
12. Save each plot as a PDF using ggsave()
Task 7: Relate Copy Number and Expression Data
1. Read the copy number data into a data frame named copy_number_states. Remove columns from the other study, specifically those with sample IDs in the form “MTS-XXXX” and the Entrez ID column. Keep only the expression data for the specified genes: “ESR1”, “ERBB2”, “PGR”, “TP53”, “PIK3CA”, “GATA3”, “FOXA1”, “MLPH”.
2. Convert it into a tidy format with a row-per-sample-per-gene where the values are the copy number states.
3. Remove the Entrez_Gene_id
column and convert the copy number state variable into a factor using the following names for each state.
- -2 deletion
- -1 loss
- 0 neutral
- 1 gain
- 2 amplification
4. Count the total number of occurrences of each copy number state.
5. Create a combined data set containing the expression value and copy number state for each patient and gene pairing. The data frame should contain the columns: Sample
, Gene
, Expression_Date
and Copy_Number_State
.
6. Create a series of box plots for each of the genes with a box-and-whiskers showing the range of expression values for each copy number state.
7. Customize this plot by changing the labels, scales, colours and theme as you like – be creative! Save the plot as a PDF using ggsave()