import pandas as pd
metabric_patients = pd.read_csv(
'https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv',
index_col='Patient_ID'
)Analysing Patient Data
Words tell a story; statistics help us check whether that story fits the data. Once the Metabric data are loaded and labelled, the next step is to turn questions into short, reproducible operations: filter the rows you care about, pick the variables that answer the question, and apply the right summary or comparison.
Before we can analyze patient data, we need to load it into Python in a way that makes it easy to work with. Let’s load the Metabric patient data and set the patient ID as the index, so rows are labelled meaningfully. If you already have metabric_patients in memory from a previous episode, you can reuse it; otherwise, run the following to (re)load it:
Statistics on Data
Before jumping into complex models, we start by describing the data. Descriptive statistics answer simple questions like “what is typical?” and “how much do values vary?” These summaries help you build intuition and spot problems early (e.g., unexpected scales or missing values).
The pandas library provides a wide range of statistical functions and methods to compute summary statistics for your data. Below provides some of the key statistical measures you can compute using this library.
Mean
Mean (average) of a single column (one variable):
metabric_patients.loc[:, 'Survival_time'].mean()125.12132352924738
This answers “on average, what is the value of this variable?” Here we use the Survival_time column.
Mean of all numeric columns (column-wise):
metabric_patients.mean(numeric_only = True)Cohort 2.643908
Age_at_diagnosis 61.087054
Survival_time 125.121324
Tumour_size 26.238726
Tumour_stage 1.750535
Neoplasm_histologic_grade 2.415939
Lymph_nodes_examined_positive 2.002101
Lymph_node_status 1.639706
Nottingham_prognostic_index 4.033019
Mutation_count 5.697687
ESR1 9.607824
ERBB2 10.765364
PGR 6.237203
TP53 6.197967
PIK3CA 5.970097
GATA3 9.502910
FOXA1 10.800526
MLPH 11.362384
dtype: float64
This computes the mean for each numeric column independently, giving you a quick overview of typical values across the dataset.
numeric_only=True?
Many DataFrames mix numeric columns (like Tumour_size) with non-numeric columns (like Patient_ID or Cohort labels). Adding numeric_only=True tells pandas to ignore non-numeric columns when computing a statistic across the whole DataFrame. This avoids errors or confusing results. For a single numeric column (e.g., metabric_patients.loc[:, 'Survival_time']), you don’t need this argument; it’s most helpful for DataFrame-wide operations.
Mean across each row:
metabric_patients.mean(axis=1, numeric_only=True)Patient_ID
MB-0000 19.046874
MB-0002 12.483745
MB-0005 17.399631
MB-0006 18.072797
MB-0008 14.313514
...
MB-7295 20.963895
MB-7296 11.924471
MB-7297 23.663816
MB-7298 17.058346
MB-7299 21.721870
Length: 1904, dtype: float64
This treats each row as a small collection of values and averages across columns. It only makes sense when the columns you include are on a comparable scale (for example, a set of related measurements). The axis=1 argument tells Pandas to compute across columns for each row.
axis=
axis=0(default) means “by column”: compute one result per column across rows.axis=1means “by row”: compute one result per row across columns.
Most reducers (sum, mean, min, max, etc.) accept axis= and follow the same pattern.
Median
The median (robust “typical” value) is the middle value when data are sorted. It is not affected by extreme values (outliers) and is often used to describe the central tendency of data.
metabric_patients.median(numeric_only = True)Cohort 3.000000
Age_at_diagnosis 61.770000
Survival_time 115.616667
Tumour_size 23.000000
Tumour_stage 2.000000
Neoplasm_histologic_grade 3.000000
Lymph_nodes_examined_positive 0.000000
Lymph_node_status 1.000000
Nottingham_prognostic_index 4.042000
Mutation_count 5.000000
ESR1 10.252166
ERBB2 10.530301
PGR 5.877591
TP53 6.185873
PIK3CA 5.938094
GATA3 9.911805
FOXA1 11.367947
MLPH 11.873967
dtype: float64
Visualising mean vs median:
The mean pulls toward extreme values more than the median. In skewed distributions, the two lines will sit apart.
Standard Deviation
The standard deviation (spread around the mean) measures the amount of variation or dispersion in a dataset. A lower standard deviation indicates that data points are close to the mean, while a higher standard deviation indicates greater variability.
metabric_patients.std(numeric_only = True)Cohort 1.228615
Age_at_diagnosis 12.978711
Survival_time 76.334148
Tumour_size 15.160976
Tumour_stage 0.628999
Neoplasm_histologic_grade 0.650612
Lymph_nodes_examined_positive 4.079993
Lymph_node_status 0.743809
Nottingham_prognostic_index 1.144492
Mutation_count 4.058778
ESR1 2.133827
ERBB2 1.357359
PGR 1.020871
TP53 0.401864
PIK3CA 0.352549
GATA3 1.502636
FOXA1 1.754282
MLPH 1.687555
dtype: float64
Variance
Variance (spread in squared units) is the square of the standard deviation. It quantifies how much individual data points deviate from the mean. It’s useful mathematically, but note its units are squared (e.g., cm² if the original variable is in cm).
metabric_patients.var(numeric_only = True)Cohort 1.509495
Age_at_diagnosis 168.446952
Survival_time 5826.902195
Tumour_size 229.855200
Tumour_stage 0.395640
Neoplasm_histologic_grade 0.423296
Lymph_nodes_examined_positive 16.646343
Lymph_node_status 0.553252
Nottingham_prognostic_index 1.309861
Mutation_count 16.473681
ESR1 4.553217
ERBB2 1.842424
PGR 1.042178
TP53 0.161494
PIK3CA 0.124291
GATA3 2.257914
FOXA1 3.077506
MLPH 2.847843
dtype: float64
Sum
Use .sum() to add up values.
Total for each numeric column (column-wise):
metabric_patients.sum(numeric_only = True)Cohort 5034.000000
Age_at_diagnosis 116309.750000
Survival_time 238231.000000
Tumour_size 49433.760000
Tumour_stage 2456.000000
Neoplasm_histologic_grade 4426.000000
Lymph_nodes_examined_positive 3812.000000
Lymph_node_status 3122.000000
Nottingham_prognostic_index 7678.867520
Mutation_count 10592.000000
ESR1 18293.296572
ERBB2 20497.253069
PGR 11875.634913
TP53 11800.928408
PIK3CA 11367.064843
GATA3 18093.541447
FOXA1 20564.200657
MLPH 21633.979019
dtype: float64
Total for a single column (e.g., Mutation_count across all patients):
metabric_patients.loc[:, 'Mutation_count'].sum()10592.0
If you ever sum across rows (axis=1), be sure the columns are on comparable scales.
Row-wise sum of selected numeric columns (use with care):
metabric_patients.loc[:, 'ESR1':'MLPH'].sum(axis = 1, numeric_only = True)Patient_ID
MB-0000 60.602853
MB-0002 74.864079
MB-0005 71.593354
MB-0006 72.647012
MB-0008 74.226588
...
MB-7295 72.369544
MB-7296 74.062673
MB-7297 73.368202
MB-7298 76.548554
MB-7299 69.311795
Length: 1904, dtype: float64
What the output shows
- DataFrame sum (no
axis=given): a Series (one total per numeric column). - Single-column sum: a single number.
- Row-wise sum (
axis=1): a Series indexed by row labels (one total per patient).
Count
Use .count() to count non-missing values. This is helpful to see how complete each column is.
Count of non-missing values for each column:
metabric_patients.count()Cohort 1904
Age_at_diagnosis 1904
Survival_time 1904
Survival_status 1904
Vital_status 1903
Chemotherapy 1904
Radiotherapy 1904
Tumour_size 1884
Tumour_stage 1403
Neoplasm_histologic_grade 1832
Lymph_nodes_examined_positive 1904
Lymph_node_status 1904
Cancer_type 1889
ER_status 1904
PR_status 1904
HER2_status 1904
HER2_status_measured_by_SNP6 1904
PAM50 1904
3-gene_classifier 1700
Nottingham_prognostic_index 1904
Cellularity 1850
Integrative_cluster 1904
Mutation_count 1859
ESR1 1904
ERBB2 1904
PGR 1904
TP53 1904
PIK3CA 1904
GATA3 1904
FOXA1 1904
MLPH 1904
dtype: int64
This returns the counts or the number of entries excluding missing values for each column in the dataframe.
Count of non-missing values for a single column:
metabric_patients.loc[:, 'Survival_status'].count()1904
This returns a single number - non-missing entries in the Survival_status column.
Here is a quick reference summary table of common useful functions.
| Function | Description |
|---|---|
count |
Number of non-NA observations |
sum |
Sum of values |
mean |
Mean of values |
median |
Arithmetic median of values |
min |
Minimum |
max |
Maximum |
mode |
Mode |
abs |
Absolute Value |
prod |
Product of values |
std |
Bessel-corrected sample standard deviation |
var |
Unbiased variance |
sem |
Standard error of the mean |
skew |
Sample skewness (3rd moment) |
kurt |
Sample kurtosis (4th moment) |
quantile |
Sample quantile (value at %) |
cumsum |
Cumulative sum |
cumprod |
Cumulative product |
cummax |
Cumulative maximum |
cummin |
Cumulative minimum |
In summary, descriptive statistics provide a powerful first look at your data, helping you understand its central tendencies, variability, and completeness. By using functions like mean, median, std, and value_counts, you can quickly identify patterns, spot anomalies, and prepare your dataset for deeper analysis. Mastering these basic tools in pandas will make your data exploration more efficient and insightful.
Recall from the previous episode that you can select specific rows using boolean conditions (masks). See the section on boolean masks. We’ll apply the above statistics to filtered subsets. All the statistical operators that work on entire DataFrames work the same way on slices.
Below are a few examples showing how to compute min and max over columns and rows, and how to find which patient or column produced that value.
Let’s say we want to analyze patients from cohort 1, whose IDs range from MB-0000 to MB-0906. For example, to find the maximum tumour size among these patients, we can use the following code:
cohort_mask = metabric_patients.loc[:, "Cohort"] == 1
print('Maximum for Tumour_size column for cohort 1 patients:')Maximum for Tumour_size column for cohort 1 patients:
print(metabric_patients.loc[cohort_mask, "Tumour_size"].max())180.0
For the same set of patients, you can check the minimum mutation count and identify which patient has it. For example:
print('\nMinimum for Mutation_count column for cohort 1 patients:')
Minimum for Mutation_count column for cohort 1 patients:
print(metabric_patients.loc[cohort_mask, "Mutation_count"].min())1.0
Useful Helper Functions
Before plotting or building more complex summaries, a few small helpers make your life easier. These tools don’t change the meaning of your data—they help you clean rows, control order, and ask focused questions. After each command, check what changed: the values, the order, the labels, or the row count.
Missing Values
It’s important to make sure your data doesn’t contain missing values (represented as NaN, which stands for “Not a Number”). Missing values can cause errors or misleading results in your analysis. For example, calculating the mean of a column with missing values will return NaN, and plotting functions might skip or misrepresent those data points.
To handle this, you can use the dropna() method. This method removes any rows (or columns) that contain missing values. By default, dropna() removes rows where any column has a missing value.
Drop rows where either ‘3-gene_classifier’ or ‘GATA3’ is missing:
subset = metabric_patients.loc[:, ['3-gene_classifier', 'GATA3']].dropna()
print('Rows before:', len(metabric_patients))Rows before: 1904
print('Rows after dropna:', len(subset))Rows after dropna: 1700
You’ll see the original row count and a smaller count after dropna() command. The subset DataFrame keeps only rows where both columns are present. By default, rows are dropped if any of the specified columns are missing. To drop columns instead of rows, use axis=1. To control which columns are checked, pass subset=[...] (as above).
Unique Values
When you’re exploring categorical columns, a first sanity check is to list which labels actually appear. This helps you spot typos (“Lum a” vs “LumA”), unexpected categories, or missing values before you count or plot. Pandas provides .unique() for this: it returns the distinct values observed in a column.
unique() returns an array of the distinct values. If the column contains missing values (NaN), they will be included in the result. Use dropna() first if you want to exclude missing values from the listing.
metabric_patients.loc[:, '3-gene_classifier'].unique()array(['ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif',
'HER2+'], dtype=object)
# Distinct labels, excluding missing entries
metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()array(['ER-/HER2-', 'ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif',
'HER2+'], dtype=object)
The order of the output is based on first appearance in the column; it’s not alphabetical. If you want you can sort the categories as follows:
labels = metabric_patients.loc[:, '3-gene_classifier'].dropna().unique()
sorted_labels = sorted(labels)
sorted_labels['ER+/HER2- High Prolif', 'ER+/HER2- Low Prolif', 'ER-/HER2-', 'HER2+']
Listing the set of categories is often followed by asking “how many of each?”. To get a frequency table see the Frequency of Values section below.
Frequency of Values
To inspect how often each distinct value appears in a column, use value_counts(). This is particularly helpful for categorical variables (for example, labels) to check balance, spot rare categories, and reveal unexpected or missing labels. By default it returns counts sorted by frequency.
metabric_patients.value_counts()Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage Neoplasm_histologic_grade Lymph_nodes_examined_positive Lymph_node_status Cancer_type ER_status PR_status HER2_status HER2_status_measured_by_SNP6 PAM50 3-gene_classifier Nottingham_prognostic_index Cellularity Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
1 26.36 32.633333 DECEASED Died of Disease NO YES 12.0 1.0 3.0 0 1 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- High Prolif 4.024 High 6 1.0 9.588608 9.490468 7.928097 6.356356 5.680179 9.411429 10.718919 10.956553 1
3 55.53 62.900000 DECEASED Died of Disease NO NO 30.0 2.0 3.0 0 1 Breast Invasive Ductal Carcinoma Negative Negative Negative NEUTRAL Basal ER-/HER2- 4.060 High 10 4.0 6.253704 8.004007 5.490949 5.905707 6.398763 7.535037 6.905833 7.705025 1
55.72 79.300000 LIVING Living NO YES 15.0 1.0 3.0 0 1 Breast Invasive Ductal Carcinoma Negative Negative Negative NEUTRAL LumA ER+/HER2- Low Prolif 4.030 Moderate 4ER- 7.0 6.216032 11.759620 5.237508 6.098317 5.764115 6.892657 11.372491 12.298825 1
55.65 186.633333 DECEASED Died of Other Causes NO YES 20.0 2.0 3.0 9 3 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- High Prolif 6.040 Moderate 3 8.0 11.589441 10.791974 6.134051 6.716117 6.035893 10.242622 11.648412 12.421318 1
55.62 174.833333 LIVING Living NO YES 40.0 2.0 2.0 2 2 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- Low Prolif 4.080 Moderate 8 7.0 11.665731 11.480590 6.425127 6.465125 6.238929 9.793889 10.949489 12.199768 1
..
2 43.51 28.566667 DECEASED Died of Disease YES YES 20.0 3.0 3.0 6 3 Breast Invasive Ductal Carcinoma Negative Negative Negative NEUTRAL Basal ER-/HER2- 6.040 High 10 13.0 5.539984 10.292837 5.407988 5.518323 6.311530 7.297214 5.636049 7.157132 1
43.56 225.500000 LIVING Living NO YES 23.0 1.0 3.0 0 1 Breast Invasive Ductal Carcinoma Positive Positive Positive GAIN LumA ER+/HER2- High Prolif 4.046 High 4ER+ 4.0 9.691953 12.735390 9.109610 6.620104 6.473224 9.526202 11.764086 11.211695 1
43.58 237.133333 LIVING Living NO YES 27.0 1.0 3.0 0 1 Breast Invasive Ductal Carcinoma Positive Negative Negative GAIN LumB ER+/HER2- High Prolif 4.054 High 1 6.0 10.267879 10.644271 5.668715 5.994670 5.793478 11.428410 12.125305 13.132729 1
43.67 188.733333 LIVING Living NO NO 17.0 2.0 2.0 1 2 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- Low Prolif 4.034 Moderate 8 16.0 10.031496 10.181827 6.612446 6.322806 6.528300 9.847294 11.458224 12.150357 1
5 86.61 63.566667 DECEASED Died of Other Causes NO NO 16.0 1.0 1.0 0 1 Breast Invasive Lobular Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- Low Prolif 2.032 Moderate 3 10.0 10.899206 9.560221 7.459093 6.164949 5.869367 10.422461 11.099896 12.124662 1
Name: count, Length: 1121, dtype: int64
When you call value_counts() on an entire DataFrame, each index represents a unique combination of values across all columns, and the count shows how many rows share that exact combination. Since each row in our dataset is unique, all counts will be 1—this is a quick way to check for duplicate rows.
For a specific column (e.g., 3-gene_classifier):
metabric_patients.loc[:, '3-gene_classifier'].value_counts()3-gene_classifier
ER+/HER2- Low Prolif 619
ER+/HER2- High Prolif 603
ER-/HER2- 290
HER2+ 188
Name: count, dtype: int64
When used on a single column, value_counts() returns a Series where the index lists each unique value (such as category names) and the corresponding values show how many times each occurs. In the example above, you’ll see four categories and their respective counts.
Set normalize=True to display proportions (percentages) rather than raw counts.
metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True)3-gene_classifier
ER+/HER2- Low Prolif 0.364118
ER+/HER2- High Prolif 0.354706
ER-/HER2- 0.170588
HER2+ 0.110588
Name: proportion, dtype: float64
To display percentages, multiply the result by 100:
metabric_patients.loc[:, '3-gene_classifier'].value_counts(normalize=True) * 1003-gene_classifier
ER+/HER2- Low Prolif 36.411765
ER+/HER2- High Prolif 35.470588
ER-/HER2- 17.058824
HER2+ 11.058824
Name: proportion, dtype: float64
By default, missing values (or NaN) are excluded from the result. This means that only rows with actual, non-missing values are counted in the frequency table. If you want to include missing values as their own category in the output, you can pass the argument dropna=False to value_counts(). This will show how many times NaN appears in the column, which is useful for quickly assessing the amount of missing data in a categorical variable.
metabric_patients.loc[:, '3-gene_classifier'].value_counts(dropna=False)3-gene_classifier
ER+/HER2- Low Prolif 619
ER+/HER2- High Prolif 603
ER-/HER2- 290
NaN 204
HER2+ 188
Name: count, dtype: int64
Use dropna=False to see how many entries are missing. If your analysis should exclude missing rows entirely, clean first with .dropna() on the relevant columns (see the Missing Values section above) and then run value_counts().
Counts returned by value_counts() are automatically sorted from most to least frequent. If you want to sort the counts by the actual labels (in alphabetical order), you can use the sort_index() method. This is useful when you need to view your data categories in a predictable, ordered way rather than by their frequency.
metabric_patients.loc[:, 'Integrative_cluster'].value_counts().sort_index()Integrative_cluster
1 132
10 219
2 72
3 282
4ER+ 244
4ER- 74
5 184
6 84
7 182
8 289
9 142
Name: count, dtype: int64
To display categories in a custom order (such as a clinically meaningful sequence), use the reindex() method. This lets you specify the exact order of labels you want to see in your output, rather than relying on alphabetical or frequency-based sorting. For example, if you have a desired order of cluster labels, you can filter to only those present in your data and then reindex the counts:
counts = metabric_patients.loc[:, 'Integrative_cluster'].value_counts()
# custom order (keep only labels that exist)
desired = ["1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"]
present = [lab for lab in desired if lab in counts.index]
counts_custom = counts.reindex(present)
counts_customIntegrative_cluster
1 132
2 72
3 282
4ER- 74
4ER+ 244
5 184
6 84
7 182
8 289
9 142
10 219
Name: count, dtype: int64
This approach first uses value_counts() to tally each unique value in the Integrative_cluster column, producing a pandas Series with cluster labels as the index and their counts as values. The desired list defines the preferred order for these labels.
The code then filters desired to include only labels that are actually present in the data. Using reindex(present) rearranges the counts Series to match this custom order, ensuring the output is organized as intended.
If a label from desired is missing in the data, it is simply omitted from the result—no NaNs are introduced. This method is especially helpful for displaying or plotting results in a meaningful sequence.
- Compute common summaries (mean, median, std, var, sum, count, value_counts).
- Switch between column-wise and row-wise operations with
axis=and usenumeric_onlyfor mixed dtypes. - Filter rows with boolean masks and summarize subsets.
- Handle missing values with
dropna()when appropriate. - Explore categories with
unique().
| ← Previous | Next → |