Visualizing Data

Overview

ggplot2 package simplifies the creation of plots using data frames. This is the next step in the tidyverse workflow.

This package offers a streamlined interface for defining variables to plot, configuring their display, and adjusting visual attributes. Consequently, adapting to changes in the data or transitioning between plot types requires only minimal modifications. This feature facilitates the creation of high-quality plots suitable for publication with minimal manual adjustments.

ggplot prefers data in the “long” format, where each dimension occupies a column and each observation corresponds to a row. Structuring data in this manner (discussed previously) enhances efficiency when generating figures with ggplot.

We will be using an extended version of the Metabric data set (from the assignment) in which columns have been added for the mRNA expression values for selected genes, including estrogen receptor alpha (ESR1), progesterone receptor (PGR), GATA3 and FOXA1.

library(tidyverse)
metabric <- read_csv("data/metabric/clinical_and_expression_data.csv")

Building a Basic Plot

The construction of ggplot graphics is incremental, allowing for the addition of new elements in layers. This approach grants users extensive flexibility and customization options, enabling the creation of tailored plots to suit specific needs.

To build a ggplot, any of the following basic templates can be used for different types of plots. My preferred choice is the one highlighted in pink, which will be consistently used in subsequent examples.

Three things are required for a ggplot:

1. The data

We first specify the data frame that contains the relevant data to create a plot. Here we are sending the metabric dataset to the ggplot() function.

# render plot background
metabric |> ggplot()

This command results in an empty gray panel. We must specify how various columns of the data frame should be depicted in the plot.

2. Aesthetics aes()

Next, we specify the columns in the data we want to map to visual properties (called aesthetics or aes in ggplot2). e.g. the columns for x values, y values and colours.

Since we are interested in generating a scatter plot, each point will have an x and a y coordinate. Therefore, we need to specify the x-axis to represent the year and y-axis to represent the count.

metabric |> ggplot(aes(x = GATA3, y = ESR1))

This results in a plot which includes the grid lines, the variables and the scales for x and y axes. However, the plot is empty or lacks data points.

3. Geometric Representation geom_()

Finally, we specify the type of plot (the geom). There are different types of geoms:

geom_blank() draws an empty plot.

geom_segment() draws a straight line. geom_vline() draws a vertical line and geom_hline() draws a horizontal line.

geom_curve() draws a curved line.

geom_line()/geom_path() makes a line plot. geom_line() connects points from left to right and geom_path() connects points in the order they appear in the data.


geom_point() produces a scatterplot.

geom_jitter() adds a small amount of random noise to the points in a scatter plot.

geom_dotplot() produces a dot plot.

geom_smooth() adds a smooth trend line to a plot.

geom_quantile() draws fitted quantile with lines (a scatter plot with regressed quantiles).

geom_density() creates a density plot.


geom_histogram() produces a histogram.

geom_bar() makes a bar chart. Height of the bar is proportional to the number of cases in each group.

geom_col() makes a bar chart. Height of the bar is proportional to the values in data.


geom_boxplot() produces a box plot.

geom_violin() creates a violin plot.


geom_ribbon() produces a ribbon (y interval defined line).

geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines).

geom_rect(), geom_tile() and geom_raster() draw rectangles.

geom_polygon() draws polygons, which are filled paths.


geom_text() adds text to a plot.

geom_text() adds label to a plot.

The range of geoms available in ggplot2 can be obtained by navigating to the ggplot2 package in the Packages tab pane in RStudio (bottom right-hand corner) and scrolling down the list of functions sorted alphabetically to the geom_... functions.

Since we are interested in creating a scatter plot, the geometric representation of the data will be in point form. Therefore we use the geom_point() function.

To plot the expression of estrogen receptor alpha (ESR1) against that of the transcription factor, GATA3:

metabric |> ggplot(aes(x = GATA3, y = ESR1)) + geom_point() 

Notice that we use the + sign to add a layer of points to the plot. This concept bears resemblance to Adobe Photoshop, where layers of images can be rearranged and edited independently. In ggplot, each layer is added over the plot in accordance with its position in the code using the + sign.

A note about |> and +

ggplot2 package was developed prior to the introduction of the pipe operator. In ggplot2, the + sign functions analogously to the pipe operator in other tidyverse functions, enabling code to be written from left to right.

Customizing Plots

Adding Colour

The above plot could be made more informative. For instance, the additional information regarding the ER status (i.e., ER_IHC column) could be incorporated into the plot. To do this, we can utilize aes() and specify which column in the metabric data frame should be represented as the color of the points.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC)) 

Notice that we specify the colour = ER_IHC argument in the aes() mapping inside the geom_() function instead of ggplot() function. Aesthetic mappings can be set in both ggplot() and individual geom() layers and we will discuss the difference in the Section: Adding Layers.

To colour points based on a continuous variable, for example: Nottingham prognostic index (NPI):

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = NPI)) 

In ggplot2, a color scale is used for continuous variables, while discrete or categorical values are represented using discrete colors.

Note that some patient samples lack expression values, leading ggplot2 to remove those points with missing values for ESR1 and GATA3.

Adding Shape

Let’s add shape to points.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) + 
  geom_point(aes(shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that some patient samples have not been classified and ggplot has removed those points with missing values for the three-gene classifier.

Some aesthetics like shape can only be used with categorical variables:

metabric |> ggplot() +
  geom_point(aes(x = GATA3, y = ESR1, shape = SURVIVAL_TIME))
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_f()`:
! A continuous variable cannot be mapped to the shape aesthetic.
ℹ Choose a different aesthetic or use `scale_shape_binned()`.

The shape argument allows you to customize the appearance of all data points by assigning an integer associated with predefined shapes shown below:

To use asterix instead of points in the plot:

metabric |> ggplot(aes(x = GATA3, y = ESR1)) + 
  geom_point(shape = 8)

It would be useful to be able to change the shape of all the points. We can do so by setting the size to a single value rather than mapping it to one of the variables in the data set - this has to be done outside the aesthetic mappings (i.e. outside the aes() bit) as above.

Aesthetic Setting vs. Mapping

Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters (outside aes()). We map an aesthetic to a variable (e.g., aes(shape = THREEGENE)) or set it to a constant (e.g., shape = 8). If you want appearance to be governed by a variable in your data frame, put the specification inside aes(); if you want to override the default size or colour, put the value outside of aes().

# size outside aes()
metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(shape = 8)
# size inside aes()
metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).

The above plots are created with similar code, but have rather different outputs. The first plot sets the size to a value and the second plot maps (not sets) the size to the three-gene classifier variable.

It is usually preferable to use colours to distinguish between different categories but sometimes colour and shape are used together when we want to show which group a data point belongs to in two different categorical variables.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = CLAUDIN_SUBTYPE, shape = THREEGENE))
Warning: Removed 209 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding Size and Transparency

We can adjust the size and/or transparency of the points.

Let’s first increase the size of points.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = CLAUDIN_SUBTYPE), size = 2)

Note that here we add the size argument outside of the the aesthetic mapping.

Size is not usually a good aesthetic to map to a variable and hence is not advised.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = CLAUDIN_SUBTYPE, size = ER_IHC))
Warning: Using size for a discrete variable is not advised.

Because this value is discrete, the default size scale uses evenly spaced sizes for points categorized on ER status.

Transparency can be useful when we have a large number of points as we can more easily tell when points are overlaid, but like size, it is not usually mapped to a variable and sits outside the aes().

Let’s change the transparency of points.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = THREEGENE), alpha = 0.5) 

Adding Layers

We can add another layer to this plot using a different geometric representation (or geom_ function) we discussed previously.

Let’s add trend lines to this plot using the geom_smooth() function which provide a summary of the data.

metabric |> ggplot() +
  geom_point(aes(x = GATA3, y = ESR1)) +
  geom_smooth(aes(x = GATA3, y = ESR1))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Note that the shaded area surrounding blue line represents the standard error bounds on the fitted model.

There is some annoying duplication of code used to create this plot. We’ve repeated the exact same aesthetic mapping for both geoms. We can avoid this by putting the mappings in the ggplot() function instead.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point() +
  geom_smooth()

Geom layers specified earlier in the command are drawn first, preceding subsequent geom layers. The sequence of geom layers specified in the command determines their order of appearance in the plot.

If you switch the order of the geom_point() and geom_smooth() functions above, you’ll notice a change in the regression line. Specifically, the regression line will now be plotted underneath the points.

Let’s make the plot look a bit prettier by reducing the size of the points and making them transparent. We’re not mapping size or alpha to any variables, just setting them to constant values, and we only want these settings to apply to the points, so we set them inside geom_point().

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() 
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Aesthetic Specifications in Plot vs. Layers

Aesthetic mappings can be provided either in the initial ggplot() call, in individual layers, or through a combination of both approaches. When there’s only one layer in the plot, the method used to specify aesthetics doesn’t impact the result.

# colour argument inside ggplot()
metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() 
# colour argument inside geom_point()
metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() 

In the left plot, since we specified the colour (i.e., colour = ER_IHC) inside the ggplot() function, the geom_smooth() function will fit regression lines for each type of ER status and will have coloured regression lines as shown above. This is because, when aesthetic mappings are defined in ggplot(), at the global level, they’re passed down to each of the subsequent geom layers of the plot.

If we want to add colour only to the points and fit a regression line across all points, we could specify the colour inside geom_point() function (i.e., right plot).

Suppose you’ve spent a bit of time getting your scatter plot just right and decide to add another layer but you’re a bit worried about interfering with the code you so lovingly crafted, you can set the inherit.aes option to FALSE and set the aesthetic mappings explicitly for your new layer.

metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(aes(x = GATA3, y = ESR1), inherit.aes = FALSE)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Coordinate Space

ggplot automatically selects the scale and type of coordinate space for each axis. The majority of plots utilize Cartesian coordinate space, characterized by linear x and y scales.

We can change the axes limits as follows:

# assign a variable to the plot
gata_esrp <- metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() 

# change both x and y axes
gata_esrp + lims(x = c(0, 13), y = c(0, 14))
# change x axis
gata_esrp + xlim(0, NA)  
# change x axis
gata_esrp + ylim(0, 13)

When modifying the x-axis limit above, we assigned the upper limit as NA. You can leave one value as NA if you wish to calculate the corresponding limit from the range of the data.

Notice that we assigned a variable named gata_esrp to our plot and modify it by adding labels. In ggplot, you have the flexibility to assign a variable to plot and then modify it by adding layers to the plot. This approach allows you to progressively build up your visualization, incorporating various elements to convey the desired information effectively.

lims()/xlim()/ylim() vs. coord_cartesian()

When you set the limits using any of the lims()/xlim()/ylim() functions, it discards all data points outside the specified range. Consequently, the regression line is computed across the remaining data points. In contrast, coord_cartesian() adjust limits without discarding the data, thus offering a visual zoom effect.

gata_esrp + ylim(7, 10)
gata_esrp + coord_cartesian(ylim = c(7, 10))

Axis Labels

By default, ggplot use the column names specified inside the aes() as the axis labels. We can change this using the xlab() and ylab() functions.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  xlab("GATA3 Expression") +
  ylab("ESR1 Expression")

Customizing Plots

You can customize plots to include a title, a subtitle, a caption or a tag.

To add a title and/or subtitle:

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  ggtitle(
    label = "Expression of estrogen receptor alpha against the transcription factor",
    subtitle = "ESR1 vs GATA3")

We can use the labs() function to add a title and additional information.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  labs(
    title = "Expression of estrogen receptor alpha against the transcription factor",
    subtitle = "ESR1 vs GATA3",
    caption = "This is a caption",
    tag = "Figure 1",
    y = "ESR1 Expression")

Themes

Themes control the overall appearance of the plot, including background color, grid lines, axis labels, and text styles. ggplot offers several built-in themes, and you can also create custom themes to match your preferences or the requirements of your publication. The default theme has a grey background.

gata_esrp <- metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = ER_IHC), size = 0.5, alpha = 0.5) +
  geom_smooth() 

gata_esrp + theme_bw()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Try these themes yourselves: theme_classic(), theme_dark(), theme_grey() (default), theme_light(), theme_linedraw(), theme_minimal(), theme_void() and theme_test().

Facets

To enhance readability and clarity, we can break the above plot into sub-plots, called faceting. Facets are commonly used to split a plot into multiple panels based on the values of one or more variables. This can be useful for exploring relationships in the data across different subsets or categories.

To do this, we use the tilde symbol ~ to specify the column name that will form each facet.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = PR_STATUS), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~ PR_STATUS)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Note that the aesthetics and geoms including the regression line that were specified for the original plot, are applied to each of the facets.

Alternatively, the variable(s) used for faceting can be specified using vars().

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = PR_STATUS), size = 0.5, alpha = 0.5) +
  facet_wrap(vars(PR_STATUS))

Faceting is usually better than displaying groups using different colours when there are more than two or three groups when it can be difficult to really tell which points belong to each group. A case in point is for the three-gene classification in the GATA3 vs ESR1 scatter plot we created above. Let’s create a faceted version of that plot.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
  facet_wrap(vars(THREEGENE))

This helps explain why the function is called facet_wrap(). When it has too many subplots to fit across the page, it wraps around to another row. We can control how many rows or columns to use with the nrow and ncol arguments.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
  facet_wrap(vars(THREEGENE), nrow = 1)

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = THREEGENE), size = 0.5, alpha = 0.5) +
  facet_wrap(vars(THREEGENE), ncol = 2)

We can combine faceting on one variable with a colour aesthetic for another variable. For example, let’s show the tumour stage status (Neoplasm histologic grade) using faceting and the HER2 status using colours.

metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = HER2_STATUS)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(GRADE))

Instead of this we could facet on more than variable.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(GRADE, HER2_STATUS))

Faceting on two variables is usually better done using the other faceting function, facet_grid(). Note the change in how the formula is written.

metabric |> ggplot(aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(GRADE), vars(HER2_STATUS))

Again we can use colour aesthetics alongside faceting to add further information to our visualization.

metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(GRADE), vars(HER2_STATUS))

Finally, we can use a labeller to change the labels for each of the categorical values so that these are more meaningful in the context of this plot.

grade_labels <- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
her2_status_labels <- c("Positive" = "HER2 positive", "Negative" = "HER2 negative")
#
metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(GRADE),
             vars(HER2_STATUS),
             labeller = labeller(
               GRADE = grade_labels,
               HER2_STATUS = her2_status_labels
              )
            )

This would certainly be necessary if we were to use ER and HER2 status on one side of the grid.

er_status_labels <- c("Positive" = "ER positive", "Negative" = "ER negative")
#
metabric |> ggplot(aes(x = GATA3, y = ESR1, colour = CLAUDIN_SUBTYPE)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(GRADE),
             vars(ER_IHC, HER2_STATUS),
             labeller = labeller(
               GRADE = grade_labels,
               ER_IHC = er_status_labels,
               HER2_STATUS = her2_status_labels
              )
            )

Bar chart

The metabric study redefined how we think about breast cancer by identifying and characterizing several new subtypes, referred to as integrative clusters. Let’s create a bar chart of the number of patients whose cancers fall within each subtype in the metabric cohort.

The geom_bar is the geom used to plot bar charts. It requires a single aesthetic mapping of the categorical variable of interest to x.

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST))

The dark grey bars are a big ugly - what if we want each bar to be a different colour?

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, colour = INTCLUST))

Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar to see what other aesthetic we should have used.

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = INTCLUST))

What happens if we colour (fill) with something other than the integrative cluster?

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = ER_IHC))

We get a stacked bar plot.

Note the similarity in what we did here to what we did with the scatter plot - there is a common grammar.

Let’s try another stacked bar plot, this time with a categorical variable with more than two categories.

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = THREEGENE))

We can rearrange the three gene groups into adjacent (dodged) bars by specifying a different position within geom_bar():

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = THREEGENE), position = 'dodge')

What if want all the bars to be the same colour but not dark grey, e.g. blue?

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = "blue"))

That doesn’t look right - why not?

You can set the aesthetics to a fixed value but this needs to be outside the mapping, just like we did before for size and transparency in the scatter plots.

metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST), fill = "blue")

Setting this inside the aes() mapping told ggplot2 to map the colour aesthetic to some variable in the data frame, one that doesn’t really exist but which is created on-the-fly with a value of “blue” for every observation.

You may have noticed that ggplot2 didn’t just plot values from our data set but had to do some calculation first for the bar chart, i.e. it had to sum the number of observations in each category.

Each geom has a statistical transformation. In the case of the scatter plot, geom_point uses the “identity” transformation which means just use the values as they are (i.e. not really a transformation at all). The statistical transformation for geom_bar is “count”, which means it will count the number of observations for each category in the variable mapped to the x aesthetic.

You can see which statistical transformation is being used by a geom by looking at the stat argument in the help page for that geom.

There are some circumstances where you’d want to change the stat, for example if we already had count values in our table.

# the previous plot
metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST))
# same plot after computing counts and using the identity stat
counts <- metabric |> count(INTCLUST) 
counts |> ggplot() +
  geom_bar(aes(x = INTCLUST, y = n), stat = "identity")

Box plot

Box plots (or box & whisker plots) are a particular favourite seen in many seminars and papers. Box plots summarize the distribution of a set of values by displaying the minimum and maximum values, the median (i.e. middle-ranked value), and the range of the middle 50% of values (inter-quartile range). The whisker line extending above and below the IQR box define Q3 + (1.5 x IQR), and Q1 - (1.5 x IQR) respectively.

To create a box plot from Metabric dataset:

metabric |> ggplot(aes(x = ER_IHC, y = GATA3)) +
  geom_boxplot()

See geom_boxplot help to explain how the box and whiskers are constructed and how it decides which points are outliers and should be displayed as points.

How about adding another layer to display all the points?

metabric |> ggplot(aes(x = ER_IHC, y = GATA3)) +
  geom_boxplot() +
  geom_point()

Ideally, we’d like these points to be spread out a bit. The help page of geom_point fucntion points to geom_jitter as more suitable when one of the variables is categorical.

metabric |> ggplot(aes(x = ER_IHC, y = GATA3)) +
  geom_boxplot() +
  geom_jitter()

Well, that’s a bit of a mess. We can bring the geom_boxplot() layer forward:

metabric |> ggplot(aes(x = ER_IHC, y = GATA3)) +
  geom_jitter() +
  geom_boxplot(alpha = 0.5) 

Still not the best plot. We can reduce the spread or jitter and make the points smaller and transparent:

metabric |> ggplot(aes(x = ER_IHC, y = GATA3)) +
  geom_boxplot() +
  geom_jitter(width = 0.3, size = 0.5, alpha = 0.25)

Displaying points in this way makes much more sense when we only have a few observations and where the box plot masks the fact, perhaps giving the false impression that the sample size is larger than it actually is. Here it makes less sense as we have very many observations.

Let’s try a colour aesthetic to also look at how estrogen receptor expression differs between HER2 positive and negative tumours.

metabric |> ggplot(aes(x = ER_IHC, y = GATA3, colour = HER2_STATUS)) +
  geom_boxplot() 

Violin plot

A violin plot is used to visualize the distribution of a numeric variable across different categories. It combines aspects of a box plot and a kernel density plot.

The width of the violin at any given point represents the density of data at that point. Wider sections indicate a higher density of data points, while narrower sections indicate lower density. By default, violin plots are symmetric.

metabric |> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) + 
  geom_violin()

Inside each violin plot, a box plot is often included, showing additional summary statistics such as the median, quartiles, and potential outliers. This helps provide a quick overview of the central tendency and spread of the data within each category.

metabric |> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) + 
  geom_violin() + 
  geom_boxplot(width = 0.8, alpha = 0.4)

In the above plot, the violin plots and box plots are misaligned. You can read the cause of this here.

To align them, we can use the position_dodge() function to manually adjusting the horizontal position as follows.

metabric |> ggplot(aes(y = GATA3, x = ER_IHC, colour = HER2_STATUS)) + 
  geom_violin(position = position_dodge(0.8)) + 
  geom_boxplot(width = 0.8, alpha = 0.4)

Histogram

The geom for creating histograms is, rather unsurprisingly, geom_histogram().

metabric |> ggplot() +
  geom_histogram(aes(x = AGE_AT_DIAGNOSIS))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The warning message hints at picking a more optimal number of bins by specifying the binwidth argument.

metabric |> ggplot() +
  geom_histogram(aes(x = AGE_AT_DIAGNOSIS), binwidth = 5)

Or we can set the number of bins.

metabric |> ggplot() +
  geom_histogram(aes(x = AGE_AT_DIAGNOSIS), bins = 20)

These histograms are not very pleasing, aesthetically speaking - how about some better aesthetics?

metabric |> ggplot() +
  geom_histogram(
    aes(x = AGE_AT_DIAGNOSIS), 
    bins = 20, 
    colour = "darkblue", 
    fill = "grey")

Density plot

Density plots are used to visualize the distribution of a continuous variable in a dataset. These are essentially smoothed histograms, where the area under the curve for each sub-group will sum to 1. This allows us to compare sub-groups of different size.

metabric |> ggplot() + 
  geom_density(aes(x = AGE_AT_DIAGNOSIS, colour = INTCLUST))

Categorical variables – factors

Several of the variables in the Metabric data set are categorical. Some of these have been read into R as character types (e.g. the three gene classifier), other as numerical values (e.g. tumour stage). We also have some binary variables that are essentially categorical variables but with only 2 possible values (e.g. ER status).

In many of the plots given above, ggplot2 has treated character variables as categorical in situations where a categorical variable is expected. For example, when we displayed points on a scatter plot using different colours for each three gene classification, or when we created separate box plots in the same graph for ER positive and negative patients.

But what about when our categorical variable has been read into R as a continuous variable, e.g. Tumour_stage, which is read in as a double type.

metabric |> ggplot() +
  geom_point(aes(x = GATA3, y = ESR1, colour = TUMOR_STAGE))

table(metabric$TUMOR_STAGE)

  0   1   2   3   4 
  4 490 818 118  10 

Tumour stage has only 5 discrete states but ggplot2 doesn’t know these are supposed to be a restricted set of values and has used a colour scale to show them as if they were continuous. We need to tell R that these are categorical (or factors).

Let’s convert our tumour stage variable to a factor using the as.factor() function.

metabric$TUMOR_STAGE <- as.factor(metabric$TUMOR_STAGE)
metabric |> select(PATIENT_ID, TUMOR_STAGE) |> head()
PATIENT_ID TUMOR_STAGE
MB-0000 2
MB-0002 1
MB-0005 2
MB-0006 2
MB-0008 2
MB-0010 4

R actually stores categorical variables as integers but with some additional metadata about which of the integer values, or ‘levels’, corresponds to each category.

typeof(metabric$TUMOR_STAGE)
[1] "integer"
class(metabric$TUMOR_STAGE)
[1] "factor"
levels(metabric$TUMOR_STAGE)
[1] "0" "1" "2" "3" "4"
metabric |> ggplot() +
  geom_point(aes(x = GATA3, y = ESR1, colour = TUMOR_STAGE))

In this case the order of the levels makes sense but for other variables you may wish for more control over the ordering. Take the integrative cluster variable for example. We created a bar plot of the numbers of patients in the Metabric cohort within each integrative cluster. Did you notice the ordering of the clusters? 10 came just after 1 and before 2. That looked a bit odd as we’d have naturally expected it to come last of all. R, on the other hand, is treating this vector as a character vector (mainly because of the ‘ER-’ and ‘ER+’ subtypes of cluster 4, and sorts the values into alphanumerical order.

metabric$INTCLUST <- as.factor(metabric$INTCLUST)
levels(metabric$INTCLUST)
 [1] "1"    "10"   "2"    "3"    "4ER-" "4ER+" "5"    "6"    "7"    "8"   
[11] "9"   

As discussed Section: Factors, we can create a factor using the factor() function and specify the levels using the levels argument.

metabric$INTCLUST <- factor(metabric$INTCLUST, levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"))
levels(metabric$INTCLUST)
 [1] "1"    "2"    "3"    "4ER-" "4ER+" "5"    "6"    "7"    "8"    "9"   
[11] "10"  
metabric |> ggplot() +
  geom_bar(aes(x = INTCLUST, fill = INTCLUST))

Line plot

A line plot is used to display the trend or pattern in data over a continuous range of values, typically along the x-axis (horizontal axis).

Before we create a line plot, let’s start by reading a subset of cancer_mort dataset using the read_csv() function:

library(tidyverse)
# first read the dataset
cancer_mort_full <- read_csv("data/Australian_Cancer_Incidence_and_Mortality.csv")  
# lets consider the rows with cancer types that starts with B letters only. 
# this is done for illustartion purposes. 
cancer_mort <- cancer_mort_full |> filter(str_detect(Cancer_Type, '^B[a-z]+'))

Next, we filter the cancer_mort data frame to plot only the counts for the female patients in the age group 55-59 and are categorized as moratality cases.

# define a new subset from cancer_mort dataset
cancer_mort_55 <- cancer_mort |> 
  filter(Age == '55-59' & Type == "Mortality", Sex == 'Female')
cancer_mort_55 |> ggplot(aes(x = Year, y = Count)) + 
  geom_line(aes(colour = Cancer_Type)) 

Another aesthetic available for geom_line is linetype.

cancer_mort_55 |> ggplot(aes(x = Year, y = Count)) + 
  geom_line(aes(linetype = Cancer_Type)) 

Saving plot images

Use ggsave() to save the last plot you displayed.

ggsave("integrative_cluster.png")

You can alter the width and height of the plot and can change the image file type.

ggsave("integrative_cluster.pdf", width = 20, height = 12, units = "cm")

You can also pass in a plot object you have created instead of using the last plot displayed. See the help page (?ggsave) for more details.

The ggplot object – a peek under the hood

Let’s build a ggplot2 plot up in stages to understand what’s really going on.

plot <- metabric |> ggplot()

What is the plot object we’ve just created?

typeof(plot)
[1] "list"
length(plot)
[1] 11

The plot object is a list containing 11 elements. It’s actually a special type of list.

class(plot)
[1] "gg"     "ggplot"

What are the 11 things in the list?

names(plot)
 [1] "data"        "layers"      "scales"      "guides"      "mapping"    
 [6] "theme"       "coordinates" "facet"       "plot_env"    "layout"     
[11] "labels"     

The data element is just the metabric tibble that we provided to the ggplot() function.

plot$data |> head()
PATIENT_ID LYMPH_NODES_EXAMINED_POSITIVE NPI CELLULARITY CHEMOTHERAPY COHORT ER_IHC HER2_SNP6 INTCLUST AGE_AT_DIAGNOSIS SURVIVAL_TIME SURVIVAL_STATUS CLAUDIN_SUBTYPE THREEGENE VITAL_STATUS RADIO_THERAPY CANCER_TYPE_DETAILED HER2_STATUS PR_STATUS GRADE TUMOR_SIZE TUMOR_STAGE ERBB2 ESR1 FOXA1 GATA3 MLPH PGR PIK3CA TP53
MB-0000 10 6.044 NA NO 1 Positve NEUTRAL 4ER+ 75.65 140.50000 LIVING claudin-low ER-/HER2- Living YES Breast Invasive Ductal Carcinoma Negative Negative 3 22 2 9.333972 8.929817 7.953794 6.932146 9.729728 5.680501 5.704157 6.338739
MB-0002 0 4.020 High NO 1 Positve NEUTRAL 4ER+ 43.19 84.63333 LIVING LumA ER+/HER2- High Prolif Living YES Breast Invasive Ductal Carcinoma Negative Positive 3 10 1 9.729606 10.047059 11.843989 11.251197 12.536570 7.505424 5.757727 6.192507
MB-0005 1 4.030 High YES 1 Positve NEUTRAL 3 48.87 163.70000 DECEASED LumB NA Died of Disease NO Breast Invasive Ductal Carcinoma Negative Positive 2 15 2 9.725825 10.041281 11.698169 9.289758 10.306115 7.376123 6.751566 6.404516
MB-0006 3 4.050 Moderate YES 1 Positve NEUTRAL 9 47.68 164.93333 LIVING LumB NA Living YES Breast Mixed Ductal and Lobular Carcinoma Negative Positive 2 25 2 10.334979 10.404685 11.863379 8.667723 10.472181 6.815637 7.219187 6.869241
MB-0008 8 6.080 High YES 1 Positve NEUTRAL 9 76.97 41.36667 DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Mixed Ductal and Lobular Carcinoma Negative Positive 3 40 2 9.956267 11.276581 11.625006 9.719781 12.161961 7.331223 5.817818 6.337951
MB-0010 0 4.062 Moderate NO 1 Positve NEUTRAL 7 78.77 7.80000 DECEASED LumB ER+/HER2- High Prolif Died of Disease YES Breast Invasive Ductal Carcinoma Negative Positive 3 31 4 9.739996 11.239750 12.142178 9.787085 11.433164 5.954311 6.123056 5.419711

This plot list object is really only a specification for how ggplot2 should create the plot. ggplot2 only renders the plot when we ask it to by printing it out to the screen by typing print(plot) or simply just plot.

We haven’t added any layers yet so what does our plot look like at this point?

plot

ggplot2 doesn’t know what to plot yet as we haven’t added a layer (geom).

plot$mapping
Aesthetic mapping: 
<empty>
plot$layers
list()

Lets add an aesthetic mapping.

plot <- metabric |> ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC))
plot

ggplot2 has automatically added scales for x and y based on the ranges of values for the Nottingham prognostic index and ESR1 expression. Still nothing has been plotted as we haven’t yet specified the type of plot (geom) to add as a layer.

plot$mapping
Aesthetic mapping: 
* `x`      -> `NPI`
* `y`      -> `ESR1`
* `colour` -> `ER_IHC`
plot$layers
list()

Finally, let’s add a geom_point and geom_smooth layers to create a scatter plot.

plot <- plot + 
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") 
plot
`geom_smooth()` using formula = 'y ~ x'

plot$layers
[[1]]
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 

[[2]]
geom_smooth: na.rm = FALSE, orientation = NA, se = TRUE
stat_smooth: na.rm = FALSE, orientation = NA, se = TRUE, method = lm
position_identity 

We touched on statistical transformations earlier. The stat associated with a geom_point is stat_identity which leaves values unchanged. In the case of a scatter plot, we already have the x and y values – they don’t need to be transformed, just plotted on the x and y axes. On the other hand, the stat associated with a geom_smooth is stat_smooth which uses a linear function for smoothing.

Customizing Plots - continued

Scales

One of the components of the plot is called scales. ggplot2 automatically adds default scales behind the scene equivalent to the following:

plot <- metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

Note that we have three aesthetics and ggplot2 adds a scale for each.

plot$mapping
Aesthetic mapping: 
* `x`      -> `NPI`
* `y`      -> `ESR1`
* `colour` -> `ER_IHC`

The x and y variables (NPI and ESR1) are continuous so ggplot2 adds a continuous scale for each. ER_IHC is a discrete variable in this case so ggplot2 adds a discrete scale for colour.

Generalizing, the scales that are required follow the naming scheme:

scale_<NAME_OF_AESTHETIC>_<NAME_OF_SCALE>

Look at the help page for scale_y_continuous to see what we can change about the y-axis scale.

First we’ll change the breaks, i.e. where ggplot2 puts ticks and numeric labels, on the y axis.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = seq(5, 15, by = 2.5))
`geom_smooth()` using formula = 'y ~ x'

seq() is a useful function for generating regular sequences of numbers. In this case we wanted numbers from 5 to 15 going up in steps of 2.5.

seq(5, 15, by = 2.5)
[1]  5.0  7.5 10.0 12.5 15.0

We could do the same thing for the x axis using scale_x_continuous().

We can also adjust the extents of the x or y axis.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = seq(5, 15, by = 2.5), limits = c(4, 12))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 163 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 163 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, just for demonstration purposes, we set the upper limit to be less than the largest values of ESR1 expression and ggplot2 warned us that some rows have been removed from the plot.

We can change the minor breaks, e.g. to add more lines that act as guides. These are shown as thin white lines when using the default theme.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(
    breaks = seq(5, 12.5, by = 2.5), 
    minor_breaks = seq(5, 13.5, 0.5), 
    limits = c(5, 13.5))
`geom_smooth()` using formula = 'y ~ x'

Or we can remove the minor breaks entirely.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(
    breaks = seq(6, 14, by = 2), 
    minor_breaks = NULL, 
    limits = c(5, 13.5))
`geom_smooth()` using formula = 'y ~ x'

Similarly we could remove all breaks entirely.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = NULL)
`geom_smooth()` using formula = 'y ~ x'

A more typical scenario would be to keep the breaks, because we want to display the ticks and their lables, but remove the grid lines. Somewhat confusingly the position of grid lines are controlled by a scale but preventing these from being displayed requires changing the theme. The theme controls the way in which non-data components are displayed – we’ll look at how these can be customized later. For now, though, here’s an example of turning off the display of all grid lines for major and minor breaks for both axes.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = seq(4, 14, by = 2), limits = c(4, 14)) +
  theme(panel.grid = element_blank())
`geom_smooth()` using formula = 'y ~ x'

By default, the scales are expanded by 5% of the range on either side. We can add or reduce the space as follows.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_x_continuous(expand = expansion(mult = 0.01)) +
  scale_y_continuous(expand = expansion(mult = 0.25))
`geom_smooth()` using formula = 'y ~ x'

Here we only added 1% (0.01) of the range of NPI values on either side along the x axis but we added 25% (0.25) of the range of ESR1 expression on either side along the y axis.

We can move the axis to the other side of the plot –- not sure why you’d want to do this but with ggplot2 just about anything is possible.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_x_continuous(position = "top")
`geom_smooth()` using formula = 'y ~ x'

Colours

The colour asthetic is used with a categorical variable, ER_IHC, in the scatter plots we’ve been customizing. The default colour scale used by ggplot2 for categorical variables is scale_colour_discrete. We can manually set the colours we wish to use using scale_colour_manual instead.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_colour_manual(values = c("dodgerblue2", "firebrick2"))
`geom_smooth()` using formula = 'y ~ x'

Setting colours manually is ok when we only have two or three categories but when we have a larger number it would be handy to be able to choose from a selection of carefully-constructed colour palettes. Helpfully, ggplot2 provides access to the ColorBrewer palettes through the functions scale_colour_brewer() and scale_fill_brewer().

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = THREEGENE)) + 
  geom_point(size = 0.6, alpha = 0.5, na.rm = TRUE) +
  scale_colour_brewer(palette = "Set1")

Look at the help page for scale_colour_brewer to see what other colour palettes are available and visit the ColorBrewer website to see what these look like.

Interestingly, you can set other attributes other than just the colours at the same time.

metabric |> 
  ggplot(mapping = aes(x = NPI, y = ESR1, colour = ER_IHC)) +
  geom_point(size = 0.6, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_colour_manual(
    values = c("dodgerblue2", "firebrick2"), 
    labels = c("ER-negative", "ER-positive")) +
  labs(colour = NULL)  # remove legend title for colour now that the labels are self-explanatory
`geom_smooth()` using formula = 'y ~ x'

We have applied our own set of mappings from levels in the data to aesthetic values.

For continuous variables we may wish to be able to change the colours used in the colour gradient. To demonstrate this we’ll correct the Nottingham prognostic index (NPI) values and use this to colour points in the scatter plot of ESR1 vs GATA3 expression on a continuous scale.

# Nottingham_prognostic_index is incorrectly calculated in the data downloaded from cBioPortal
metabric <- metabric |> 
  mutate(NPI = 0.02 * TUMOR_SIZE + LYMPH_NODES_EXAMINED_POSITIVE + GRADE)

metabric |> 
  filter(!is.na(NPI)) |> 
  ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
  geom_point(size = 0.5)

Higher NPI scores correspond to worse prognosis and lower chance of 5 year survival. We’ll emphasize those points on the scatter plot by adjusting our colour scale.

metabric |> 
  filter(!is.na(NPI)) |> 
  ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
  geom_point(size = 0.75) +
  scale_colour_gradient(low = "white", high = "firebrick2")

In some cases it might make sense to specify two colour gradients either side of a mid-point.

metabric |> 
  filter(!is.na(NPI)) |> 
  ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
  geom_point(size = 0.75) +
  scale_colour_gradient2(
    low = "dodgerblue1", 
    mid = "grey90", 
    high = "firebrick1", 
    midpoint = 4.5)

As before we can override the default labels and other aspects of the colour scale within the scale function.

metabric |> 
  filter(!is.na(NPI)) |> 
  ggplot(mapping = aes(x = GATA3, y = ESR1, colour = NPI)) +
  geom_point(size = 0.5) +
  scale_colour_gradient(
    low = "lightblue", high = "darkblue",
    name = "NPI Values",
    breaks = 2:6,
    limits = c(1.5, 6.5)
  )

Themes

Themes can be used to customize non-data components of a plot. Let’s create a plot showing the expression of estrogen receptor alpha (ESR1) for each of the Integrative cluster breast cancer subtypes.

# read in the METABRIC data, convert the INTCLUST variable into a
# categorical variable with the levels in the correct order, and select just
# the columns and rows we're going to use
metabric <- metabric |> 
  mutate(INTCLUST = 
           factor(INTCLUST, 
                  levels = c("1", "2", "3", "4ER-", "4ER+", "5", "6", "7", "8", "9", "10"))) |> 
  mutate(THREEGENE = 
           replace_na(THREEGENE, "Unclassified")) |> 
  select(PATIENT_ID, ER_IHC, PR_STATUS, THREEGENE, INTCLUST, ESR1) |> 
  filter(!is.na(INTCLUST), !is.na(ESR1))
# plot the ESR1 expression for each integrative cluster
plot <- metabric |> ggplot() +
  geom_boxplot(aes(x = INTCLUST, y = ESR1, fill = INTCLUST)) +
  labs(x = "Integrative cluster", y = "ESR1 expression")
plot

The default theme has the characteristic grey background which isn’t particularly suitable for printing on paper. We can change to one of a number of alternative themes available in the ggplot2 package, e.g. the black and white theme.

plot + theme_bw()

Each of these themes is really just a collection of attributes relating to how various non-data elements of the plot will be displayed. We can override any of these individual settings using the theme() function. A look at the help page (?theme) shows that there are a very large number of settings that you can change. The following example demonstrates a few of these.

plot +
  theme_bw() +
  theme(
    panel.grid.major.x = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"
  )

Here’s another example that also involves customizing the labels, scales and colours.

metabric |> ggplot() +
  geom_bar(aes(x = THREEGENE, fill = ER_IHC)) +
  scale_y_continuous(
    limits = c(0, 700), 
    breaks = seq(0, 700, 100), 
    expand = expansion(mult = 0)) +
  scale_fill_manual(values = c("firebrick2", "dodgerblue2")) +
  labs(x = NULL, y = "samples", fill = "ER status") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.line.y = element_line(),
    axis.ticks.length.y = unit(0.2, "cm"),
    legend.position = "bottom"
  )

The ggthemes package contains some extra themes and might be fun to check out. Here’s an example of a plot that uses the theme_gdocs theme that resembles the default look of charts in Google Docs.

library(ggthemes)
metabric |> 
  filter(THREEGENE == "HER2+") |> 
  ggplot(aes(x = PR_STATUS, y = ESR1)) +
  geom_boxplot() +
  geom_jitter(
    aes(colour = PR_STATUS), 
    width = 0.25, 
    alpha = 0.4, 
    show.legend = FALSE) +
  scale_colour_brewer(palette = "Set1") +
  labs(x = "PR status", y = "ESR1 expression") +
  theme_gdocs()

Position adjustments

All geoms in ggplot2 have a position adjustment that can be set using the position argument. This has different effects for different types of plot but essentially this resolves how overlapping geoms are displayed.

For example, let’s consider the stacked bar plot we created earlier showing the numbers of patients in each of the 3-gene classifications subdivided by ER status. The default position value for geom_bar() is “stack” which is why the plot is shown as a stacked bar chart. An alternative way of representing these data would be to show separate bars for each ER status side-by-side by setting position = "dodge".

metabric |> ggplot() +
  geom_bar(aes(x = THREEGENE, fill = ER_IHC), position = "dodge") +
  scale_y_continuous(limits = c(0, 700), breaks = seq(0, 700, 100), expand = expansion(mult = 0)) +
  scale_fill_manual(values = c("firebrick2", "dodgerblue2")) +
  labs(x = NULL, y = "samples", fill = "ER status") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.line.y = element_line(),
    axis.ticks.length.y = unit(0.2, "cm")
  )

Another position adjustment we’ve come across is geom_jitter(), which is just a convenient shortcut for geom_point(position = "jitter"). A variation on this, position_jitterdodge(), comes in handy when we are overlaying points on top of a box plot. We show an example of just such a plot in which first use postion = "jitter".

metabric |> 
  ggplot( aes(x = THREEGENE, y = ESR1, colour = PR_STATUS)) +
  geom_boxplot() +
  geom_point(position = "jitter", size = 0.5, alpha = 0.3) +
  labs(x = "3-gene classification", y = "ESR1 expression", colour = "PR status") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.ticks.x = element_blank()
  )

The PR-negative and PR-positive points have distinct colours but are overlapping in a way that is aesthetically displeasing. What we want is for the points to have both jitter and to be dodged in the same way as the boxes. With position_jitterdodge() we get a better looking plot.

metabric |> 
  ggplot( aes(x = THREEGENE, y = ESR1, colour = PR_STATUS)) +
  geom_boxplot() +
  geom_point(position = position_jitterdodge(), size = 0.5, alpha = 0.3) +
  labs(x = "3-gene classification", y = "ESR1 expression", colour = "PR status") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    axis.ticks.x = element_blank()
  )