library(tidyverse)
<- read_csv("data/metabric/clinical_and_expression_data.csv") metabric
Visualizing Data
Overview
ggplot2
package simplifies the creation of plots using data frames. This is the next step in the tidyverse workflow.
This package offers a streamlined interface for defining variables to plot, configuring their display, and adjusting visual attributes. Consequently, adapting to changes in the data or transitioning between plot types requires only minimal modifications. This feature facilitates the creation of high-quality plots suitable for publication with minimal manual adjustments.
ggplot
prefers data in the “long” format, where each dimension occupies a column and each observation corresponds to a row. Structuring data in this manner (discussed previously) enhances efficiency when generating figures with ggplot
.
We will be using an extended version of the Metabric data set (from the assignment) in which columns have been added for the mRNA expression values for selected genes, including estrogen receptor alpha (ESR1), progesterone receptor (PGR), GATA3 and FOXA1.
Building a Basic Plot
The construction of ggplot graphics is incremental, allowing for the addition of new elements in layers. This approach grants users extensive flexibility and customization options, enabling the creation of tailored plots to suit specific needs.
To build a ggplot, any of the following basic templates can be used for different types of plots. My preferred choice is the one highlighted in pink, which will be consistently used in subsequent examples.
Three things are required for a ggplot:
1. The data
We first specify the data frame that contains the relevant data to create a plot. Here we are sending the metabric dataset to the ggplot()
function.
# render plot background
|> ggplot() metabric
This command results in an empty gray panel. We must specify how various columns of the data frame should be depicted in the plot.
2. Aesthetics aes()
Next, we specify the columns in the data we want to map to visual properties (called aesthetics or aes
in ggplot2). e.g. the columns for x values, y values and colours.
Since we are interested in generating a scatter plot, each point will have an x and a y coordinate. Therefore, we need to specify the x-axis to represent the year and y-axis to represent the count.
|> ggplot(aes(x = GATA3, y = ESR1)) metabric
This results in a plot which includes the grid lines, the variables and the scales for x and y axes. However, the plot is empty or lacks data points.
3. Geometric Representation geom_()
Finally, we specify the type of plot (the geom). There are different types of geoms:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The range of geoms available in ggplot2
can be obtained by navigating to the ggplot2
package in the Packages tab pane in RStudio (bottom right-hand corner) and scrolling down the list of functions sorted alphabetically to the geom_...
functions.
Since we are interested in creating a scatter plot, the geometric representation of the data will be in point form. Therefore we use the geom_point()
function.
To plot the expression of estrogen receptor alpha (ESR1) against that of the transcription factor, GATA3:
|> ggplot(aes(x = GATA3, y = ESR1)) + geom_point() metabric
Notice that we use the +
sign to add a layer of points to the plot. This concept bears resemblance to Adobe Photoshop, where layers of images can be rearranged and edited independently. In ggplot, each layer is added over the plot in accordance with its position in the code using the +
sign.
|>
and +
ggplot2
package was developed prior to the introduction of the pipe operator. In ggplot2, the +
sign functions analogously to the pipe operator in other tidyverse functions, enabling code to be written from left to right.
Customizing Plots
Adding Colour
The above plot could be made more informative. For instance, the additional information regarding the ER status (i.e., ER_IHC column) could be incorporated into the plot. To do this, we can utilize aes()
and specify which column in the metabric
data frame should be represented as the color of the points.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = ER_IHC))
Notice that we specify the colour = ER_IHC
argument in the aes()
mapping inside the geom_()
function instead of ggplot()
function. Aesthetic mappings can be set in both ggplot()
and individual geom()
layers and we will discuss the difference in the Section: Adding Layers.
To colour points based on a continuous variable, for example: Nottingham prognostic index (NPI):
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = NPI))
In ggplot2
, a color scale is used for continuous variables, while discrete or categorical values are represented using discrete colors.
Note that some patient samples lack expression values, leading ggplot2
to remove those points with missing values for ESR1 and GATA3.
Adding Shape
Let’s add shape to points.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).
Note that some patient samples have not been classified and ggplot has removed those points with missing values for the three-gene classifier.
Some aesthetics like shape can only be used with categorical variables:
|> ggplot() +
metabric geom_point(aes(x = GATA3, y = ESR1, shape = SURVIVAL_TIME))
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_f()`:
! A continuous variable cannot be mapped to the shape aesthetic.
ℹ Choose a different aesthetic or use `scale_shape_binned()`.
The shape argument allows you to customize the appearance of all data points by assigning an integer associated with predefined shapes shown below:
To use asterix instead of points in the plot:
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(shape = 8)
It would be useful to be able to change the shape of all the points. We can do so by setting the size to a single value rather than mapping it to one of the variables in the data set - this has to be done outside the aesthetic mappings (i.e. outside the aes()
bit) as above.
Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters (outside aes()
). We map an aesthetic to a variable (e.g., aes(shape = THREEGENE)
) or set it to a constant (e.g., shape = 8
). If you want appearance to be governed by a variable in your data frame, put the specification inside aes()
; if you want to override the default size or colour, put the value outside of aes()
.
# size outside aes()
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(shape = 8)
# size inside aes()
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).
The above plots are created with similar code, but have rather different outputs. The first plot sets the size to a value and the second plot maps (not sets) the size to the three-gene classifier variable.
It is usually preferable to use colours to distinguish between different categories but sometimes colour and shape are used together when we want to show which group a data point belongs to in two different categorical variables.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = CLAUDIN_SUBTYPE, shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).
Adding Size and Transparency
We can adjust the size and/or transparency of the points.
Let’s first increase the size of points.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = CLAUDIN_SUBTYPE), size = 2)
Note that here we add the size argument outside of the the aesthetic mapping.
Size is not usually a good aesthetic to map to a variable and hence is not advised.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = CLAUDIN_SUBTYPE, size = ER_IHC))
Warning: Using size for a discrete variable is not advised.
Because this value is discrete, the default size scale uses evenly spaced sizes for points categorized on ER status.
Transparency can be useful when we have a large number of points as we can more easily tell when points are overlaid, but like size, it is not usually mapped to a variable and sits outside the aes()
.
Let’s change the transparency of points.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = THREEGENE), alpha = 0.5)
Adding Layers
We can add another layer to this plot using a different geometric representation (or geom_
function) we discussed previously.
Let’s add trend lines to this plot using the geom_smooth()
function which provide a summary of the data.
|> ggplot() +
metabric geom_point(aes(x = GATA3, y = ESR1)) +
geom_smooth(aes(x = GATA3, y = ESR1))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Note that the shaded area surrounding blue line represents the standard error bounds on the fitted model.
There is some annoying duplication of code used to create this plot. We’ve repeated the exact same aesthetic mapping for both geoms. We can avoid this by putting the mappings in the ggplot()
function instead.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point() +
geom_smooth()
Geom layers specified earlier in the command are drawn first, preceding subsequent geom layers. The sequence of geom layers specified in the command determines their order of appearance in the plot.
If you switch the order of the geom_point()
and geom_smooth()
functions above, you’ll notice a change in the regression line. Specifically, the regression line will now be plotted underneath the points.
Let’s make the plot look a bit prettier by reducing the size of the points and making them transparent. We’re not mapping size or alpha to any variables, just setting them to constant values, and we only want these settings to apply to the points, so we set them inside geom_point()
.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Aesthetic mappings can be provided either in the initial ggplot()
call, in individual layers, or through a combination of both approaches. When there’s only one layer in the plot, the method used to specify aesthetics doesn’t impact the result.
# colour argument inside ggplot()
|> ggplot(aes(x = GATA3, y = ESR1, colour = ER_IHC)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
geom_smooth()
# colour argument inside geom_point()
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth()
In the left plot, since we specified the colour (i.e., colour = ER_IHC
) inside the ggplot()
function, the geom_smooth()
function will fit regression lines for each type of ER status and will have coloured regression lines as shown above. This is because, when aesthetic mappings are defined in ggplot()
, at the global level, they’re passed down to each of the subsequent geom layers of the plot.
If we want to add colour only to the points and fit a regression line across all points, we could specify the colour inside geom_point()
function (i.e., right plot).
Suppose you’ve spent a bit of time getting your scatter plot just right and decide to add another layer but you’re a bit worried about interfering with the code you so lovingly crafted, you can set the inherit.aes
option to FALSE
and set the aesthetic mappings explicitly for your new layer.
|> ggplot(aes(x = GATA3, y = ESR1, colour = ER_IHC)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(aes(x = GATA3, y = ESR1), inherit.aes = FALSE)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Coordinate Space
ggplot
automatically selects the scale and type of coordinate space for each axis. The majority of plots utilize Cartesian coordinate space, characterized by linear x and y scales.
We can change the axes limits as follows:
# assign a variable to the plot
<- metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
gata_esrp geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth()
# change both x and y axes
+ lims(x = c(0, 13), y = c(0, 14))
gata_esrp # change x axis
+ xlim(0, NA)
gata_esrp # change x axis
+ ylim(0, 13) gata_esrp
When modifying the x-axis limit above, we assigned the upper limit as NA. You can leave one value as NA if you wish to calculate the corresponding limit from the range of the data.
Notice that we assigned a variable named gata_esrp
to our plot and modify it by adding labels. In ggplot
, you have the flexibility to assign a variable to plot and then modify it by adding layers to the plot. This approach allows you to progressively build up your visualization, incorporating various elements to convey the desired information effectively.
lims()/xlim()/ylim()
vs. coord_cartesian()
When you set the limits using any of the lims()/xlim()/ylim()
functions, it discards all data points outside the specified range. Consequently, the regression line is computed across the remaining data points. In contrast, coord_cartesian()
adjust limits without discarding the data, thus offering a visual zoom effect.
+ ylim(7, 10)
gata_esrp + coord_cartesian(ylim = c(7, 10)) gata_esrp
Axis Labels
By default, ggplot
use the column names specified inside the aes()
as the axis labels. We can change this using the xlab()
and ylab()
functions.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth() +
xlab("GATA3 Expression") +
ylab("ESR1 Expression")
Customizing Plots
You can customize plots to include a title, a subtitle, a caption or a tag.
To add a title and/or subtitle:
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth() +
ggtitle(
label = "Expression of estrogen receptor alpha against the transcription factor",
subtitle = "ESR1 vs GATA3")
We can use the labs()
function to add a title and additional information.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth() +
labs(
title = "Expression of estrogen receptor alpha against the transcription factor",
subtitle = "ESR1 vs GATA3",
caption = "This is a caption",
tag = "Figure 1",
y = "ESR1 Expression")
Themes
Themes control the overall appearance of the plot, including background color, grid lines, axis labels, and text styles. ggplot offers several built-in themes, and you can also create custom themes to match your preferences or the requirements of your publication. The default theme has a grey background.
<- metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
gata_esrp geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
geom_smooth()
+ theme_bw() gata_esrp
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Try these themes yourselves: theme_classic()
, theme_dark()
, theme_grey()
(default), theme_light()
, theme_linedraw()
, theme_minimal()
, theme_void()
and theme_test()
.
Facets
To enhance readability and clarity, we can break the above plot into sub-plots, called faceting. Facets are commonly used to split a plot into multiple panels based on the values of one or more variables. This can be useful for exploring relationships in the data across different subsets or categories.
To do this, we use the tilde symbol ~
to specify the column name that will form each facet.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = PR_STATUS), size = 0.5, alpha = 0.5) +
geom_smooth() +
facet_wrap(~ PR_STATUS)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Note that the aesthetics and geoms including the regression line that were specified for the original plot, are applied to each of the facets.
Alternatively, the variable(s) used for faceting can be specified using vars()
.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = PR_STATUS), size = 0.5, alpha = 0.5) +
facet_wrap(vars(PR_STATUS))
Faceting is usually better than displaying groups using different colours when there are more than two or three groups when it can be difficult to really tell which points belong to each group. A case in point is for the three-gene classification in the GATA3 vs ESR1 scatter plot we created above. Let’s create a faceted version of that plot.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
facet_wrap(vars(THREEGENE))
This helps explain why the function is called facet_wrap()
. When it has too many subplots to fit across the page, it wraps around to another row. We can control how many rows or columns to use with the nrow
and ncol
arguments.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
facet_wrap(vars(THREEGENE), nrow = 1)
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
facet_wrap(vars(THREEGENE), ncol = 2)
We can combine faceting on one variable with a colour aesthetic for another variable. For example, let’s show the tumour stage status (Neoplasm histologic grade) using faceting and the HER2 status using colours.
|> ggplot(aes(x = GATA3, y = ESR1, colour = HER2_STATUS)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(GRADE))
Instead of this we could facet on more than variable.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(GRADE, HER2_STATUS))
Faceting on two variables is usually better done using the other faceting function, facet_grid()
. Note the change in how the formula is written.
|> ggplot(aes(x = GATA3, y = ESR1)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(GRADE), vars(HER2_STATUS))
Again we can use colour aesthetics alongside faceting to add further information to our visualization.
|> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(GRADE), vars(HER2_STATUS))
Finally, we can use a labeller
to change the labels for each of the categorical values so that these are more meaningful in the context of this plot.
<- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
grade_labels <- c("Positive" = "HER2 positive", "Negative" = "HER2 negative")
her2_status_labels #
|> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(GRADE),
vars(HER2_STATUS),
labeller = labeller(
GRADE = grade_labels,
HER2_STATUS = her2_status_labels
) )
This would certainly be necessary if we were to use ER and HER2 status on one side of the grid.
<- c("Positive" = "ER positive", "Negative" = "ER negative")
er_status_labels #
|> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
metabric geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(GRADE),
vars(ER_IHC, HER2_STATUS),
labeller = labeller(
GRADE = grade_labels,
ER_IHC = er_status_labels,
HER2_STATUS = her2_status_labels
) )
Bar chart
The metabric study redefined how we think about breast cancer by identifying and characterizing several new subtypes, referred to as integrative clusters. Let’s create a bar chart of the number of patients whose cancers fall within each subtype in the metabric cohort.
The geom_bar
is the geom used to plot bar charts. It requires a single aesthetic mapping of the categorical variable of interest to x
.
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST))
The dark grey bars are a big ugly - what if we want each bar to be a different colour?
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, colour = INTCLUST))
Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar
to see what other aesthetic we should have used.
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = INTCLUST))
What happens if we colour (fill) with something other than the integrative cluster?
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = ER_IHC))
We get a stacked bar plot.
Note the similarity in what we did here to what we did with the scatter plot - there is a common grammar.
Let’s try another stacked bar plot, this time with a categorical variable with more than two categories.
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = THREEGENE))
We can rearrange the three gene groups into adjacent (dodged) bars by specifying a different position within geom_bar()
:
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = THREEGENE), position = 'dodge')
What if want all the bars to be the same colour but not dark grey, e.g. blue?
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = "blue"))
That doesn’t look right - why not?
You can set the aesthetics to a fixed value but this needs to be outside the mapping, just like we did before for size and transparency in the scatter plots.
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST), fill = "blue")
Setting this inside the aes()
mapping told ggplot2 to map the colour aesthetic to some variable in the data frame, one that doesn’t really exist but which is created on-the-fly with a value of “blue” for every observation.
You may have noticed that ggplot2
didn’t just plot values from our data set but had to do some calculation first for the bar chart, i.e. it had to sum the number of observations in each category.
Each geom has a statistical transformation. In the case of the scatter plot, geom_point
uses the “identity” transformation which means just use the values as they are (i.e. not really a transformation at all). The statistical transformation for geom_bar
is “count”, which means it will count the number of observations for each category in the variable mapped to the x aesthetic.
You can see which statistical transformation is being used by a geom by looking at the stat
argument in the help page for that geom.
There are some circumstances where you’d want to change the stat
, for example if we already had count values in our table.
# the previous plot
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST))
# same plot after computing counts and using the identity stat
<- metabric |> count(INTCLUST)
counts |> ggplot() +
counts geom_bar(aes(x = INTCLUST, y = n), stat = "identity")
Box plot
Box plots (or box & whisker plots) are a particular favourite seen in many seminars and papers. Box plots summarize the distribution of a set of values by displaying the minimum and maximum values, the median (i.e. middle-ranked value), and the range of the middle 50% of values (inter-quartile range). The whisker line extending above and below the IQR box define Q3 + (1.5 x IQR), and Q1 - (1.5 x IQR) respectively.
To create a box plot from Metabric dataset:
|> ggplot(aes(x = ER_IHC, y = GATA3)) +
metabric geom_boxplot()
See geom_boxplot
help to explain how the box and whiskers are constructed and how it decides which points are outliers and should be displayed as points.
How about adding another layer to display all the points?
|> ggplot(aes(x = ER_IHC, y = GATA3)) +
metabric geom_boxplot() +
geom_point()
Ideally, we’d like these points to be spread out a bit. The help page of geom_point
fucntion points to geom_jitter
as more suitable when one of the variables is categorical.
|> ggplot(aes(x = ER_IHC, y = GATA3)) +
metabric geom_boxplot() +
geom_jitter()
Well, that’s a bit of a mess. We can bring the geom_boxplot()
layer forward:
|> ggplot(aes(x = ER_IHC, y = GATA3)) +
metabric geom_jitter() +
geom_boxplot(alpha = 0.5)
Still not the best plot. We can reduce the spread or jitter and make the points smaller and transparent:
|> ggplot(aes(x = ER_IHC, y = GATA3)) +
metabric geom_boxplot() +
geom_jitter(width = 0.3, size = 0.5, alpha = 0.25)
Displaying points in this way makes much more sense when we only have a few observations and where the box plot masks the fact, perhaps giving the false impression that the sample size is larger than it actually is. Here it makes less sense as we have very many observations.
Let’s try a colour aesthetic to also look at how estrogen receptor expression differs between HER2 positive and negative tumours.
|> ggplot(aes(x = ER_IHC, y = GATA3, colour = HER2_STATUS)) +
metabric geom_boxplot()
Violin plot
A violin plot is used to visualize the distribution of a numeric variable across different categories. It combines aspects of a box plot and a kernel density plot.
The width of the violin at any given point represents the density of data at that point. Wider sections indicate a higher density of data points, while narrower sections indicate lower density. By default, violin plots are symmetric.
|> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) +
metabric geom_violin()
Inside each violin plot, a box plot is often included, showing additional summary statistics such as the median, quartiles, and potential outliers. This helps provide a quick overview of the central tendency and spread of the data within each category.
|> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) +
metabric geom_violin() +
geom_boxplot(width = 0.8, alpha = 0.4)
In the above plot, the violin plots and box plots are misaligned. You can read the cause of this here.
To align them, we can use the position_dodge()
function to manually adjusting the horizontal position as follows.
|> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) +
metabric geom_violin(position = position_dodge(0.8)) +
geom_boxplot(width = 0.8, alpha = 0.4)
Histogram
The geom for creating histograms is, rather unsurprisingly, geom_histogram()
.
|> ggplot() +
metabric geom_histogram(aes(x = AGE_AT_DIAGNOSIS))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The warning message hints at picking a more optimal number of bins by specifying the binwidth
argument.
|> ggplot() +
metabric geom_histogram(aes(x = AGE_AT_DIAGNOSIS), binwidth = 5)
Or we can set the number of bins.
|> ggplot() +
metabric geom_histogram(aes(x = AGE_AT_DIAGNOSIS), bins = 20)
These histograms are not very pleasing, aesthetically speaking - how about some better aesthetics?
|> ggplot() +
metabric geom_histogram(
aes(x = AGE_AT_DIAGNOSIS),
bins = 20,
colour = "darkblue",
fill = "grey")
Density plot
Density plots are used to visualize the distribution of a continuous variable in a dataset. These are essentially smoothed histograms, where the area under the curve for each sub-group will sum to 1. This allows us to compare sub-groups of different size.
|> ggplot() +
metabric geom_density(aes(x = AGE_AT_DIAGNOSIS, colour = INTCLUST))
Categorical variables – factors
Several of the variables in the Metabric data set are categorical. Some of these have been read into R as character types (e.g. the three gene classifier), other as numerical values (e.g. tumour stage). We also have some binary variables that are essentially categorical variables but with only 2 possible values (e.g. ER status).
In many of the plots given above, ggplot2
has treated character variables as categorical in situations where a categorical variable is expected. For example, when we displayed points on a scatter plot using different colours for each three gene classification, or when we created separate box plots in the same graph for ER positive and negative patients.
But what about when our categorical variable has been read into R as a continuous variable, e.g. Tumour_stage
, which is read in as a double type.
|> ggplot() +
metabric geom_point(aes(x = GATA3, y = ESR1, colour = TUMOR_STAGE))
table(metabric$TUMOR_STAGE)
0 1 2 3 4
4 490 818 118 10
Tumour stage has only 5 discrete states but ggplot2
doesn’t know these are supposed to be a restricted set of values and has used a colour scale to show them as if they were continuous. We need to tell R that these are categorical (or factors).
Let’s convert our tumour stage variable to a factor using the as.factor()
function.
$TUMOR_STAGE <- as.factor(metabric$TUMOR_STAGE)
metabric|> select(PATIENT_ID, TUMOR_STAGE) |> head() metabric
PATIENT_ID | TUMOR_STAGE |
---|---|
MB-0000 | 2 |
MB-0002 | 1 |
MB-0005 | 2 |
MB-0006 | 2 |
MB-0008 | 2 |
MB-0010 | 4 |
R actually stores categorical variables as integers but with some additional metadata about which of the integer values, or ‘levels’, corresponds to each category.
typeof(metabric$TUMOR_STAGE)
[1] "integer"
class(metabric$TUMOR_STAGE)
[1] "factor"
levels(metabric$TUMOR_STAGE)
[1] "0" "1" "2" "3" "4"
|> ggplot() +
metabric geom_point(aes(x = GATA3, y = ESR1, colour = TUMOR_STAGE))
In this case the order of the levels makes sense but for other variables you may wish for more control over the ordering. Take the integrative cluster variable for example. We created a bar plot of the numbers of patients in the Metabric cohort within each integrative cluster. Did you notice the ordering of the clusters? 10 came just after 1 and before 2. That looked a bit odd as we’d have naturally expected it to come last of all. R, on the other hand, is treating this vector as a character vector (mainly because of the ‘ER-’ and ‘ER+’ subtypes of cluster 4, and sorts the values into alphanumerical order.
$INTCLUST <- as.factor(metabric$INTCLUST)
metabriclevels(metabric$INTCLUST)
[1] "1" "10" "2" "3" "4ER-" "4ER+" "5" "6" "7" "8"
[11] "9"
As discussed Section: Factors, we can create a factor using the factor()
function and specify the levels using the levels
argument.
$INTCLUST <- factor(metabric$INTCLUST, levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"))
metabriclevels(metabric$INTCLUST)
[1] "1" "2" "3" "4ER-" "4ER+" "5" "6" "7" "8" "9"
[11] "10"
|> ggplot() +
metabric geom_bar(aes(x = INTCLUST, fill = INTCLUST))
Line plot
A line plot is used to display the trend or pattern in data over a continuous range of values, typically along the x-axis (horizontal axis).
Before we create a line plot, let’s start by reading a subset of cancer_mort dataset using the read_csv()
function:
library(tidyverse)
# first read the dataset
<- read_csv("data/Australian_Cancer_Incidence_and_Mortality.csv")
cancer_mort_full # lets consider the rows with cancer types that starts with B letters only.
# this is done for illustartion purposes.
<- cancer_mort_full |> filter(str_detect(Cancer_Type, '^B[a-z]+')) cancer_mort
Next, we filter the cancer_mort data frame to plot only the counts for the female patients in the age group 55-59 and are categorized as moratality cases.
# define a new subset from cancer_mort dataset
<- cancer_mort |>
cancer_mort_55 filter(Age == '55-59' & Type == "Mortality", Sex == 'Female')
|> ggplot(aes(x = Year, y = Count)) +
cancer_mort_55 geom_line(aes(colour = Cancer_Type))
Another aesthetic available for geom_line
is linetype.
|> ggplot(aes(x = Year, y = Count)) +
cancer_mort_55 geom_line(aes(linetype = Cancer_Type))
Saving plot images
Use ggsave()
to save the last plot you displayed.
ggsave("integrative_cluster.png")
You can alter the width and height of the plot and can change the image file type.
ggsave("integrative_cluster.pdf", width = 20, height = 12, units = "cm")
You can also pass in a plot object you have created instead of using the last plot displayed. See the help page (?ggsave
) for more details.
The ggplot object – a peek under the hood
Let’s build a ggplot2 plot up in stages to understand what’s really going on.
<- metabric |> ggplot() plot
What is the plot object we’ve just created?
typeof(plot)
[1] "list"
length(plot)
[1] 11
The plot object is a list containing 11 elements. It’s actually a special type of list.
class(plot)
[1] "gg" "ggplot"
What are the 11 things in the list?
names(plot)
[1] "data" "layers" "scales" "guides" "mapping"
[6] "theme" "coordinates" "facet" "plot_env" "layout"
[11] "labels"
The data
element is just the metabric tibble that we provided to the ggplot()
function.
$data |> head() plot
PATIENT_ID | LYMPH_NODES_EXAMINED_POSITIVE | NPI | CELLULARITY | CHEMOTHERAPY | COHORT | ER_IHC | HER2_SNP6 | INTCLUST | AGE_AT_DIAGNOSIS | SURVIVAL_TIME | SURVIVAL_STATUS | CLAUDIN_SUBTYPE | THREEGENE | VITAL_STATUS | RADIO_THERAPY | CANCER_TYPE_DETAILED | HER2_STATUS | PR_STATUS | GRADE | TUMOR_SIZE | TUMOR_STAGE | ERBB2 | ESR1 | FOXA1 | GATA3 | MLPH | PGR | PIK3CA | TP53 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MB-0000 | 10 | 6.044 | NA | NO | 1 | Positve | NEUTRAL | 4ER+ | 75.65 | 140.50000 | LIVING | claudin-low | ER-/HER2- | Living | YES | Breast Invasive Ductal Carcinoma | Negative | Negative | 3 | 22 | 2 | 9.333972 | 8.929817 | 7.953794 | 6.932146 | 9.729728 | 5.680501 | 5.704157 | 6.338739 |
MB-0002 | 0 | 4.020 | High | NO | 1 | Positve | NEUTRAL | 4ER+ | 43.19 | 84.63333 | LIVING | LumA | ER+/HER2- High Prolif | Living | YES | Breast Invasive Ductal Carcinoma | Negative | Positive | 3 | 10 | 1 | 9.729606 | 10.047059 | 11.843989 | 11.251197 | 12.536570 | 7.505424 | 5.757727 | 6.192507 |
MB-0005 | 1 | 4.030 | High | YES | 1 | Positve | NEUTRAL | 3 | 48.87 | 163.70000 | DECEASED | LumB | NA | Died of Disease | NO | Breast Invasive Ductal Carcinoma | Negative | Positive | 2 | 15 | 2 | 9.725825 | 10.041281 | 11.698169 | 9.289758 | 10.306115 | 7.376123 | 6.751566 | 6.404516 |
MB-0006 | 3 | 4.050 | Moderate | YES | 1 | Positve | NEUTRAL | 9 | 47.68 | 164.93333 | LIVING | LumB | NA | Living | YES | Breast Mixed Ductal and Lobular Carcinoma | Negative | Positive | 2 | 25 | 2 | 10.334979 | 10.404685 | 11.863379 | 8.667723 | 10.472181 | 6.815637 | 7.219187 | 6.869241 |
MB-0008 | 8 | 6.080 | High | YES | 1 | Positve | NEUTRAL | 9 | 76.97 | 41.36667 | DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Mixed Ductal and Lobular Carcinoma | Negative | Positive | 3 | 40 | 2 | 9.956267 | 11.276581 | 11.625006 | 9.719781 | 12.161961 | 7.331223 | 5.817818 | 6.337951 |
MB-0010 | 0 | 4.062 | Moderate | NO | 1 | Positve | NEUTRAL | 7 | 78.77 | 7.80000 | DECEASED | LumB | ER+/HER2- High Prolif | Died of Disease | YES | Breast Invasive Ductal Carcinoma | Negative | Positive | 3 | 31 | 4 | 9.739996 | 11.239750 | 12.142178 | 9.787085 | 11.433164 | 5.954311 | 6.123056 | 5.419711 |
This plot
list object is really only a specification for how ggplot2 should create the plot. ggplot2 only renders the plot when we ask it to by printing it out to the screen by typing print(plot)
or simply just plot
.
We haven’t added any layers yet so what does our plot look like at this point?
plot
ggplot2
doesn’t know what to plot yet as we haven’t added a layer (geom).
$mapping plot
Aesthetic mapping:
<empty>
$layers plot
list()
Lets add an aesthetic mapping.
<- metabric |> ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC))
plot plot
ggplot2
has automatically added scales for x and y based on the ranges of values for the Nottingham prognostic index and ESR1 expression. Still nothing has been plotted as we haven’t yet specified the type of plot (geom) to add as a layer.
$mapping plot
Aesthetic mapping:
* `x` -> `NPI`
* `y` -> `ESR1`
* `colour` -> `ER_IHC`
$layers plot
list()
Finally, let’s add a geom_point
and geom_smooth
layers to create a scatter plot.
<- plot +
plot geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm")
plot
`geom_smooth()` using formula = 'y ~ x'
$layers plot
[[1]]
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity
[[2]]
geom_smooth: na.rm = FALSE, orientation = NA, se = TRUE
stat_smooth: na.rm = FALSE, orientation = NA, se = TRUE, method = lm
position_identity
We touched on statistical transformations earlier. The stat
associated with a geom_point
is stat_identity
which leaves values unchanged. In the case of a scatter plot, we already have the x and y values – they don’t need to be transformed, just plotted on the x and y axes. On the other hand, the stat
associated with a geom_smooth
is stat_smooth
which uses a linear function for smoothing.
Customizing Plots - continued
Scales
One of the components of the plot is called scales
. ggplot2
automatically adds default scales behind the scene equivalent to the following:
<- metabric |>
plot ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
Note that we have three aesthetics and ggplot2 adds a scale for each.
$mapping plot
Aesthetic mapping:
* `x` -> `NPI`
* `y` -> `ESR1`
* `colour` -> `ER_IHC`
The x and y variables (NPI
and ESR1
) are continuous so ggplot2
adds a continuous scale for each. ER_IHC
is a discrete variable in this case so ggplot2
adds a discrete scale for colour.
Generalizing, the scales that are required follow the naming scheme:
<NAME_OF_AESTHETIC>_<NAME_OF_SCALE> scale_
Look at the help page for scale_y_continuous
to see what we can change about the y-axis scale.
First we’ll change the breaks, i.e. where ggplot2
puts ticks and numeric labels, on the y axis.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = seq(5, 15, by = 2.5))
`geom_smooth()` using formula = 'y ~ x'
seq()
is a useful function for generating regular sequences of numbers. In this case we wanted numbers from 5 to 15 going up in steps of 2.5.
seq(5, 15, by = 2.5)
[1] 5.0 7.5 10.0 12.5 15.0
We could do the same thing for the x axis using scale_x_continuous()
.
We can also adjust the extents of the x or y axis.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = seq(5, 15, by = 2.5), limits = c(4, 12))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 163 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 163 rows containing missing values or values outside the scale range
(`geom_point()`).
Here, just for demonstration purposes, we set the upper limit to be less than the largest values of ESR1 expression and ggplot2
warned us that some rows have been removed from the plot.
We can change the minor breaks, e.g. to add more lines that act as guides. These are shown as thin white lines when using the default theme.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(
breaks = seq(5, 12.5, by = 2.5),
minor_breaks = seq(5, 13.5, 0.5),
limits = c(5, 13.5))
`geom_smooth()` using formula = 'y ~ x'
Or we can remove the minor breaks entirely.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(
breaks = seq(6, 14, by = 2),
minor_breaks = NULL,
limits = c(5, 13.5))
`geom_smooth()` using formula = 'y ~ x'
Similarly we could remove all breaks entirely.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = NULL)
`geom_smooth()` using formula = 'y ~ x'
A more typical scenario would be to keep the breaks, because we want to display the ticks and their lables, but remove the grid lines. Somewhat confusingly the position of grid lines are controlled by a scale but preventing these from being displayed requires changing the theme. The theme controls the way in which non-data components are displayed – we’ll look at how these can be customized later. For now, though, here’s an example of turning off the display of all grid lines for major and minor breaks for both axes.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = seq(4, 14, by = 2), limits = c(4, 14)) +
theme(panel.grid = element_blank())
`geom_smooth()` using formula = 'y ~ x'
By default, the scales are expanded by 5% of the range on either side. We can add or reduce the space as follows.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_x_continuous(expand = expansion(mult = 0.01)) +
scale_y_continuous(expand = expansion(mult = 0.25))
`geom_smooth()` using formula = 'y ~ x'
Here we only added 1% (0.01) of the range of NPI values on either side along the x axis but we added 25% (0.25) of the range of ESR1 expression on either side along the y axis.
We can move the axis to the other side of the plot –- not sure why you’d want to do this but with ggplot2
just about anything is possible.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_x_continuous(position = "top")
`geom_smooth()` using formula = 'y ~ x'
Colours
The colour asthetic is used with a categorical variable, ER_IHC
, in the scatter plots we’ve been customizing. The default colour scale used by ggplot2
for categorical variables is scale_colour_discrete
. We can manually set the colours we wish to use using scale_colour_manual
instead.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("dodgerblue2", "firebrick2"))
`geom_smooth()` using formula = 'y ~ x'
Setting colours manually is ok when we only have two or three categories but when we have a larger number it would be handy to be able to choose from a selection of carefully-constructed colour palettes. Helpfully, ggplot2
provides access to the ColorBrewer palettes through the functions scale_colour_brewer()
and scale_fill_brewer()
.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = THREEGENE)) +
geom_point(size = 0.6, alpha = 0.5, na.rm = TRUE) +
scale_colour_brewer(palette = "Set1")
Look at the help page for scale_colour_brewer
to see what other colour palettes are available and visit the ColorBrewer website to see what these look like.
Interestingly, you can set other attributes other than just the colours at the same time.
|>
metabric ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
geom_point(size = 0.6, alpha = 0.5) +
geom_smooth(method = "lm") +
scale_colour_manual(
values = c("dodgerblue2", "firebrick2"),
labels = c("ER-negative", "ER-positive")) +
labs(colour = NULL) # remove legend title for colour now that the labels are self-explanatory
`geom_smooth()` using formula = 'y ~ x'
We have applied our own set of mappings from levels in the data to aesthetic values.
For continuous variables we may wish to be able to change the colours used in the colour gradient. To demonstrate this we’ll correct the Nottingham prognostic index (NPI) values and use this to colour points in the scatter plot of ESR1 vs GATA3 expression on a continuous scale.
# Nottingham_prognostic_index is incorrectly calculated in the data downloaded from cBioPortal
<- metabric |>
metabric mutate(NPI = 0.02 * TUMOR_SIZE + LYMPH_NODES_EXAMINED_POSITIVE + GRADE)
|>
metabric filter(!is.na(NPI)) |>
ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
geom_point(size = 0.5)
Higher NPI scores correspond to worse prognosis and lower chance of 5 year survival. We’ll emphasize those points on the scatter plot by adjusting our colour scale.
|>
metabric filter(!is.na(NPI)) |>
ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
geom_point(size = 0.75) +
scale_colour_gradient(low = "white", high = "firebrick2")
In some cases it might make sense to specify two colour gradients either side of a mid-point.
|>
metabric filter(!is.na(NPI)) |>
ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
geom_point(size = 0.75) +
scale_colour_gradient2(
low = "dodgerblue1",
mid = "grey90",
high = "firebrick1",
midpoint = 4.5)
As before we can override the default labels and other aspects of the colour scale within the scale function.
|>
metabric filter(!is.na(NPI)) |>
ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
geom_point(size = 0.5) +
scale_colour_gradient(
low = "lightblue", high = "darkblue",
name = "NPI Values",
breaks = 2:6,
limits = c(1.5, 6.5)
)
Themes
Themes can be used to customize non-data components of a plot. Let’s create a plot showing the expression of estrogen receptor alpha (ESR1) for each of the Integrative cluster breast cancer subtypes.
# read in the METABRIC data, convert the INTCLUST variable into a
# categorical variable with the levels in the correct order, and select just
# the columns and rows we're going to use
<- metabric |>
metabric mutate(INTCLUST =
factor(INTCLUST,
levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"))) |>
mutate(THREEGENE =
replace_na(THREEGENE, "Unclassified")) |>
select(PATIENT_ID, ER_IHC, PR_STATUS, THREEGENE, INTCLUST, ESR1) |>
filter(!is.na(INTCLUST), !is.na(ESR1))
# plot the ESR1 expression for each integrative cluster
<- metabric |> ggplot() +
plot geom_boxplot(aes(x = INTCLUST, y = ESR1, fill = INTCLUST)) +
labs(x = "Integrative cluster", y = "ESR1 expression")
plot
The default theme has the characteristic grey background which isn’t particularly suitable for printing on paper. We can change to one of a number of alternative themes available in the ggplot2 package, e.g. the black and white theme.
+ theme_bw() plot
Each of these themes is really just a collection of attributes relating to how various non-data elements of the plot will be displayed. We can override any of these individual settings using the theme()
function. A look at the help page (?theme
) shows that there are a very large number of settings that you can change. The following example demonstrates a few of these.
+
plot theme_bw() +
theme(
panel.grid.major.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
Here’s another example that also involves customizing the labels, scales and colours.
|> ggplot() +
metabric geom_bar(aes(x = THREEGENE, fill = ER_IHC)) +
scale_y_continuous(
limits = c(0, 700),
breaks = seq(0, 700, 100),
expand = expansion(mult = 0)) +
scale_fill_manual(values = c("firebrick2", "dodgerblue2")) +
labs(x = NULL, y = "samples", fill = "ER status") +
theme_bw() +
theme(
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.line.y = element_line(),
axis.ticks.length.y = unit(0.2, "cm"),
legend.position = "bottom"
)
The ggthemes package contains some extra themes and might be fun to check out. Here’s an example of a plot that uses the theme_gdocs
theme that resembles the default look of charts in Google Docs.
library(ggthemes)
|>
metabric filter(THREEGENE == "HER2+") |>
ggplot(aes(x = PR_STATUS, y = ESR1)) +
geom_boxplot() +
geom_jitter(
aes(colour = PR_STATUS),
width = 0.25,
alpha = 0.4,
show.legend = FALSE) +
scale_colour_brewer(palette = "Set1") +
labs(x = "PR status", y = "ESR1 expression") +
theme_gdocs()
Position adjustments
All geoms in ggplot2
have a position adjustment that can be set using the position
argument. This has different effects for different types of plot but essentially this resolves how overlapping geoms are displayed.
For example, let’s consider the stacked bar plot we created earlier showing the numbers of patients in each of the 3-gene classifications subdivided by ER status. The default position value for geom_bar()
is “stack” which is why the plot is shown as a stacked bar chart. An alternative way of representing these data would be to show separate bars for each ER status side-by-side by setting position = "dodge"
.
|> ggplot() +
metabric geom_bar(aes(x = THREEGENE, fill = ER_IHC), position = "dodge") +
scale_y_continuous(limits = c(0, 700), breaks = seq(0, 700, 100), expand = expansion(mult = 0)) +
scale_fill_manual(values = c("firebrick2", "dodgerblue2")) +
labs(x = NULL, y = "samples", fill = "ER status") +
theme_bw() +
theme(
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.line.y = element_line(),
axis.ticks.length.y = unit(0.2, "cm")
)
Another position adjustment we’ve come across is geom_jitter()
, which is just a convenient shortcut for geom_point(position = "jitter")
. A variation on this, position_jitterdodge()
, comes in handy when we are overlaying points on top of a box plot. We show an example of just such a plot in which first use postion = "jitter"
.
|>
metabric ggplot( aes(x = THREEGENE, y = ESR1, colour = PR_STATUS)) +
geom_boxplot() +
geom_point(position = "jitter", size = 0.5, alpha = 0.3) +
labs(x = "3-gene classification", y = "ESR1 expression", colour = "PR status") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.ticks.x = element_blank()
)
The PR-negative and PR-positive points have distinct colours but are overlapping in a way that is aesthetically displeasing. What we want is for the points to have both jitter and to be dodged in the same way as the boxes. With position_jitterdodge()
we get a better looking plot.
|>
metabric ggplot( aes(x = THREEGENE, y = ESR1, colour = PR_STATUS)) +
geom_boxplot() +
geom_point(position = position_jitterdodge(), size = 0.5, alpha = 0.3) +
labs(x = "3-gene classification", y = "ESR1 expression", colour = "PR status") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.ticks.x = element_blank()
)