import matplotlib.pyplot as pltVisualising Patient Data
Visualising Data
Words tell a story; figures let us see it. In the same way that Pandas helps us wrangle tables, Matplotlib helps us turn numbers into pictures so that patterns and outliers are easier to spot.
Matplotlib is the most widely used plotting library in Python. It provides a flexible framework for creating a wide variety of static, animated, and interactive visualisations, from simple line plots and scatter plots to complex heatmaps and 3D charts. Matplotlib is highly customizable, allowing you to control every aspect of your figures, including colors, labels, legends, and more.
Visualisation deserves an entire lecture of its own, but we can explore a few features of Python’s Matplotlib library here.
First, we will import the pyplot module from matplotlib.
Loading Data for Visualisation
We’ll use the same Metabric patient data we loaded earlier. Let’s load it and set the patient ID as the index, so rows are labelled meaningfully. If you already have metabric_patients in memory from a previous episode, you can reuse it; otherwise, run the following to (re)load it:
import pandas as pd
metabric_patients = pd.read_csv(
"https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv",
index_col="Patient_ID"
)Before plotting, it’s useful to check which variables are numeric and which are categorical, since that affects the choice of plot:
metabric_patients.dtypesCohort int64
Age_at_diagnosis float64
Survival_time float64
Survival_status object
Vital_status object
Chemotherapy object
Radiotherapy object
Tumour_size float64
Tumour_stage float64
Neoplasm_histologic_grade float64
Lymph_nodes_examined_positive int64
Lymph_node_status int64
Cancer_type object
ER_status object
PR_status object
HER2_status object
HER2_status_measured_by_SNP6 object
PAM50 object
3-gene_classifier object
Nottingham_prognostic_index float64
Cellularity object
Integrative_cluster object
Mutation_count float64
ESR1 float64
ERBB2 float64
PGR float64
TP53 float64
PIK3CA float64
GATA3 float64
FOXA1 float64
MLPH float64
dtype: object
Common numeric columns in this dataset include ages and expression levels; common categorical columns include receptor status and subtype labels. We’ll use these to drive our visualisations in the next steps.
Your First Plot
Let’s explore the relationship between two important genes in breast cancer: the transcription factor GATA3 and the estrogen receptor ESR1. We’ll plot GATA3 expression on the x-axis and ESR1 expression on the y-axis.
To get started, we’ll extract these columns from the metabric_patients DataFrame and assign them to variables for easy plotting.
gata3 = metabric_patients.loc[:, "GATA3"]
esr1 = metabric_patients.loc[:, "ESR1"]If you see a KeyError: 'GATA3' (or similar), the exact column name may differ. Check available columns:
list(metabric_patients.columns)['Cohort', 'Age_at_diagnosis', 'Survival_time', 'Survival_status', 'Vital_status', 'Chemotherapy', 'Radiotherapy', 'Tumour_size', 'Tumour_stage', 'Neoplasm_histologic_grade', 'Lymph_nodes_examined_positive', 'Lymph_node_status', 'Cancer_type', 'ER_status', 'PR_status', 'HER2_status', 'HER2_status_measured_by_SNP6', 'PAM50', '3-gene_classifier', 'Nottingham_prognostic_index', 'Cellularity', 'Integrative_cluster', 'Mutation_count', 'ESR1', 'ERBB2', 'PGR', 'TP53', 'PIK3CA', 'GATA3', 'FOXA1', 'MLPH']
If the dataset uses a different case (e.g., gata3), adjust the code accordingly.
Next, we are going to create our first scatter plot with plt.scatter.
plt.scatter(gata3, esr1)A scatter plot visualises the relationship between two numeric variables by placing one on the x-axis and the other on the y-axis. Each point corresponds to a single patient. If the points show an upward trend, it suggests the variables increase together; a downward trend indicates that as one increases, the other tends to decrease.
In our Jupyter Notebook example, running the cell should generate the figure directly below the code. The figure is also included in the Notebook document for future viewing. However, other Python environments like an interactive Python session started from a terminal or a Python script executed via the command line require an additional command to display the figure.
Instruct matplotlib to show a figure:
plt.show()This command can also be used within a Notebook - for instance, to display multiple figures if several are created by a single cell.
Customising Plots
Matplotlib gives you fine control over the appearance of your plots, making it possible to create figures suitable for presentations and publications. This section explains some useful features to improve your scatter plots.
Add Axis Labels
Our scatter plot is informative, but it’s not yet self-explanatory. To make it clear which gene is represented on each axis, let’s add axis labels and a descriptive title.
plt.scatter(gata3, esr1)
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")Labels and a title turn a picture into a figure that others (and future you) can understand without extra context. Each command modifies the current figure and axes created by plt.scatter(...). Here, plt.xlabel(...) and plt.ylabel(...) sets the text under the x‑axis and the text to the left of the y‑axis. You can add a descriptive title inside the plot area with plt.title(...).
You can also control the look of these texts, for example the font size:
plt.scatter(gata3, esr1)
plt.xlabel("GATA3 expression", fontsize=11)
plt.ylabel("ESR1 expression", fontsize=11)
plt.title("GATA3 vs ESR1 (Metabric patients)", fontsize=12)Change Point Shape
You can change the shape of the points using the marker argument in plt.scatter. Common marker shapes include:
'o': circle (default)'s': square'^': triangle up'v': triangle down'D': diamond'x': x'+': plus
For example, to use x:
plt.scatter(gata3, esr1, marker='x')
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Squares)")Try different marker styles to see which best fits your data and audience.
Adding Color Size and Transparency
The above plot can be made clearer by adjusting the point color, size and transparency. This helps reduce overplotting and makes patterns easier to see.
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")Here, alpha controls transparency from 0 (fully transparent) to 1 (fully opaque). Values around 0.4–0.7 help with dense clouds of points (reduces overplotting). s argument sets marker size in points squared (pt²). Typical values for scatter plots range from 8–30; use smaller values for many points, larger values for sparse data or when exporting small figures.
Argument color (or c) sets the marker face colour. Matplotlib offers a wide range of options for customizing the color of your plot markers. You can specify colors using named strings (like "red" or "steelblue"), hexadecimal codes (such as "#1f77b4"), or even RGB tuples. For a full list of available color names and formats, see the Matplotlib color documentation. Experimenting with different colors can help highlight important patterns or make your plots more visually appealing.
You can also add a subtle outline to each point to improve separation where points overlap, especially when markers are light on a light background:
plt.scatter(
gata3, esr1,
s=16,
alpha=0.6,
color="steelblue",
edgecolor="blue", # thin white outline
linewidths=0.3 # width of the outline
)
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")Colour by a third variable
If you have another numeric column (for example, ESR1 itself), you can map values to colour with a colormap:
# Example: colour points by ESR1 value using the 'viridis' colormap
plt.scatter(gata3, esr1, c=esr1, cmap="viridis", s=14, alpha=0.6)
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 coloured by ESR1")
plt.colorbar(label="ESR1") # add a colourbar legend<matplotlib.colorbar.Colorbar object at 0x134b2d880>
We pass the esr1 column data to c=. Matplotlib maps low values to one end of the colormap and high values to the other end, then adds a colourbar as a key. Then, we specified the colour map to be used using cmap argument. Use a sequential colormap (e.g., viridis, plasma, inferno) for magnitudes, and a diverging colormap (e.g., coolwarm, RdBu_r) for variables with a meaningful center (such as zero or a reference value), like log2 fold-change or differences from a baseline.
Control Axis Limits
By default, Matplotlib “autoscales” to fit your data. Manually setting limits is helpful when you:
- Compare multiple figures or panels and want the same scale on each.
- Focus on a region of interest (e.g., crop away extreme outliers for clarity).
- Make slopes and relative differences easier to interpret across plots.
You can also set limits based on the data range with a small padding:
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")
plt.xlim(0, 20)(0.0, 20.0)
plt.ylim(0, 20)(0.0, 20.0)
If your variables are on the same scale and you want geometry to be visually comparable, set an equal aspect ratio:
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")
plt.axis('equal') # 1 unit on x equals 1 unit on y(4.90100352505, 13.18880009595, 4.81484042185, 13.66758091515)
For data spanning orders of magnitude, log scales can reveal structure:
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")
plt.xscale('log')
plt.yscale('log')Note: log scales require strictly positive values. Filter or shift data if needed before applying.
Add Grid Lines
Grid lines make it easier to read values across from the axes, especially in dense scatter plots and categorical charts. You can toggle and style them with plt.grid(...).
Basic usage on our scatter plot:
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 with grid")
# Turn on light dashed grid lines behind the points
plt.grid(True, which="major", axis="both", linestyle="--", linewidth=0.5, alpha=0.4)Tips and variants:
- Only y-axis grid (great for bar/box plots):
plt.grid(True, axis="y", linestyle="--", alpha=0.5)- Add minor ticks and a faint minor grid:
plt.minorticks_on()
plt.grid(True, which="minor", linestyle=":", linewidth=0.4, alpha=0.2)Best practice: keep grid lines subtle (low alpha, thin line, neutral colour) so they guide the eye without competing with the data. Some styles like 'seaborn-v0_8-whitegrid' enable a tasteful grid automatically.
Adding Style
Styles control the overall appearance of the plot, including colours, gridlines, fonts, background color and text styles. Matplotlib supports several style options and you can also create custom themes to match your preferences or the requirements of your publication.
plt.style.use('seaborn-v0_8-whitegrid')
plt.scatter(gata3, esr1, alpha=0.6, s=14, color="steelblue")
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")Other popular styles: 'ggplot', 'classic', 'bmh', 'fivethirtyeight'. See all available styles:
print(plt.style.available)['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
You can apply a style globally with plt.style.use(...), or only for a few lines using a context manager so the style doesn’t “leak” into later figures:
with plt.style.context('ggplot'):
plt.scatter(gata3, esr1, s=14, alpha=0.6)
plt.xlabel("GATA3 expression")
plt.ylabel("ESR1 expression")
plt.title("GATA3 vs ESR1 (Metabric patients)")To go back to Matplotlib defaults after changing styles globally, run:
plt.style.use('default')Experiment with these features to make your plots clear, attractive, and ready for publication. For more options, see the Matplotlib gallery.
Save Figures to File
Saving your figure creates a file you can share or include in reports and manuscripts.
To save your plot as a PNG or PDF for publication:
plt.savefig("scatter_gata3_esr1.png", dpi=300)The code above uses plt.savefig() from Matplotlib to save the current figure to a file named "scatter_gata3_esr1.png" with a resolution of 300 dots per inch (dpi), which is suitable for high-quality publications. You can change the file extension to save in different formats. For example, to save the same figure as a PDF, use:
plt.savefig("scatter_gata3_esr1.pdf")Bar Charts
Bar charts are useful for comparing values across categories.
The metabric study redefined how we think about breast cancer by identifying and characterizing several new subtypes, referred to as integrative clusters. Let’s create a bar chart of the number of patients whose cancers fall within each subtype in the metabric cohort.
Start with a simplest plot: compute counts and draw bars.
# Step 1: select the categorical column
int_clust = metabric_patients.loc[:, "Integrative_cluster"]
# Step 2: count how many patients fall into each category
counts = int_clust.value_counts()
# Step 3: draw a basic bar chart (no labels yet)
plt.bar(counts.index, counts.values)We used value_counts() to get the number of rows in each category. Then, we passed category names to the x‑axis and counts to the bar heights.
We can then make the figure self-explanatory by adding axis labels, a title, and readable tick labels.
plt.bar(counts.index, counts.values)
plt.xlabel("Integrative cluster")
plt.ylabel("Number of patients")
plt.title("Patient counts by integrative cluster")
plt.xticks(rotation=30, ha='right') # rotate long labels so they don’t overlap([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [Text(0, 0, '8'), Text(1, 0, '3'), Text(2, 0, '4ER+'), Text(3, 0, '10'), Text(4, 0, '5'), Text(5, 0, '7'), Text(6, 0, '9'), Text(7, 0, '1'), Text(8, 0, '6'), Text(9, 0, '4ER-'), Text(10, 0, '2')])
By default, value_counts() orders bars by frequency (highest first). You may prefer alphabetical order (easier to look up a category) or a custom, biologically meaningful order.
counts_alpha = counts.sort_index()
plt.bar(counts_alpha.index, counts_alpha.values)
plt.xlabel("Integrative cluster")
plt.ylabel("Number of patients")
plt.title("Patient counts by cluster (alphabetical)")
plt.xticks(rotation=30, ha='right')([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [Text(0, 0, '1'), Text(1, 0, '10'), Text(2, 0, '2'), Text(3, 0, '3'), Text(4, 0, '4ER+'), Text(5, 0, '4ER-'), Text(6, 0, '5'), Text(7, 0, '6'), Text(8, 0, '7'), Text(9, 0, '8'), Text(10, 0, '9')])
plt.tight_layout()Custom order (only the labels present will be plotted, in this order):
desired = [
"1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"
]
# Keep only categories that exist in our data, and reorder
present = [lab for lab in desired if lab in counts.index]
counts_custom = counts.reindex(present)
plt.bar(counts_custom.index, counts_custom.values)
plt.xlabel("Integrative cluster")
plt.ylabel("Number of patients")
plt.title("Patient counts by cluster (custom order)")
plt.xticks(rotation=30, ha='right')([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [Text(0, 0, '1'), Text(1, 0, '2'), Text(2, 0, '3'), Text(3, 0, '4ER-'), Text(4, 0, '4ER+'), Text(5, 0, '5'), Text(6, 0, '6'), Text(7, 0, '7'), Text(8, 0, '8'), Text(9, 0, '9'), Text(10, 0, '10')])
plt.tight_layout()Box Plot
Box plots (or box & whisker plots) are a particular favourite seen in many seminars and papers. Box plots summarize the distribution of a set of values by displaying the minimum and maximum values, the median (i.e. middle-ranked value), and the range of the middle 50% of values (inter-quartile range). The whisker line extending above and below the IQR box define Q3 + (1.5 x IQR), and Q1 - (1.5 x IQR) respectively.
Box plots are great for comparing distributions across categories. Here we’ll compare GATA3 expression between ER‑negative and ER‑positive patients.
# Select all ER status positive patients
mask_pos = metabric_patients.loc[:, "ER_status"] == "Positive"
pos = metabric_patients.loc[mask_pos, "GATA3"]
# Select all ER status negative patients
mask_neg = metabric_patients.loc[:, "ER_status"] == "Negative"
neg = metabric_patients.loc[mask_neg, "GATA3"]
plt.boxplot([neg, pos], labels=["Negative", "Positive"]) <string>:2: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
{'whiskers': [<matplotlib.lines.Line2D object at 0x1362f3760>, <matplotlib.lines.Line2D object at 0x1362f3550>, <matplotlib.lines.Line2D object at 0x13628cbe0>, <matplotlib.lines.Line2D object at 0x13628c250>], 'caps': [<matplotlib.lines.Line2D object at 0x1362f3280>, <matplotlib.lines.Line2D object at 0x1362f35b0>, <matplotlib.lines.Line2D object at 0x136252790>, <matplotlib.lines.Line2D object at 0x1362b39d0>], 'boxes': [<matplotlib.lines.Line2D object at 0x1362f3940>, <matplotlib.lines.Line2D object at 0x13628c940>], 'medians': [<matplotlib.lines.Line2D object at 0x13628c280>, <matplotlib.lines.Line2D object at 0x1362b33d0>], 'fliers': [<matplotlib.lines.Line2D object at 0x13628c6a0>, <matplotlib.lines.Line2D object at 0x1362b3e20>], 'means': []}
plt.xlabel("ER status")
plt.ylabel("GATA3 expression")
plt.title("GATA3 by ER status")
plt.tight_layout()Here, the first argument to plt.boxplot(...) is a list of arrays or sequences, each containing the values for one group (e.g., ER-negative and ER-positive patients). The labels argument assigns a label to each group for the x-axis.
A list is a way to store multiple items together in a single variable. In Python, you create a list by placing items inside square brackets, like this: [item1, item2, item3]. For example, [1, 2, 3]is a list containing the numbers 1, 2, and 3.
Line Plot
Line plots connect points in order and are ideal for trends over a continuous or ordered x‑axis. Here we’ll use Age at diagnosis.
# Select and sort ages (ascending) and plot cumulative count vs age
ages = metabric_patients.loc[:, "Age_at_diagnosis"]
ages_sorted = ages.sort_values().reset_index(drop=True)
# x = age, y = cumulative count (patient index after sorting)
plt.plot(ages_sorted.values, range(1, len(ages_sorted) + 1))
plt.xlabel("Age at diagnosis")
plt.ylabel("Patient index (sorted by age)")
plt.title("Age at diagnosis (sorted)")
plt.tight_layout()Here, plt.plot(...) command plots a cumulative distribution or a simple line plot. range(1, len(ages_sorted) + 1) creates a sequence of integers starting at 1 up to the number of ages. It’s often used to represent the rank or position of each age in the sorted list. It plots each age (on the x-axis) against its rank (on the y-axis). This is useful for visualizing the distribution of ages, such as a cumulative frequency plot.
To order values from low to high, use sort_values(). After sorting, the original row labels stay attached; use reset_index(drop=True) to create a clean 0..N-1 index.
If you want the y-axis to show percentages, you’d need to normalize the ranks.
To add colour and style:
plt.plot(ages_sorted.values, range(1, len(ages_sorted) + 1), color="green", linewidth=1.8, linestyle="-")
plt.xlabel("Age at diagnosis")
plt.ylabel("Patient index (sorted by age)")
plt.title("Age at diagnosis (styled line)")
plt.grid(linestyle=":", alpha=0.4)We covered fundamental plots you’ll use often, but Matplotlib supports many more: histograms, kernel‑density curves, violin/strip plots, heatmaps, contour maps, time‑series with date axes, 3D plots, annotations, and much more. The best way to learn is to browse examples and adapt them to your data.
With these tools, you can iterate from a quick exploratory figure to a polished, publication‑ready visualisation that communicates your story clearly.
- Matplotlib plus Pandas cover most everyday plotting tasks.
- Choose plots based on whether your data is numeric or categorical.
- Small tweaks (labels, colours, grids, limits) make plots much easier to read.
- Save high‑quality figures and explore more examples in the Matplotlib docs.
| ← Previous |