10. Graphics with ggplot2

Author

Affiliation

Smit, A. J.

University of the Western Cape

Published

January 1, 2021

“The simple graph has brought more information to the data analyst’s mind than any other device.”

— John Tukey

“If I cannot picture it, I cannot understand it.”

— Albert Einstein

1 Example Figures

Just to whet the appetite, below is provided a small selection of the figures that R and ggplot2 are capable of producing. These are things that AJ and/or myself have produced for publication or in some cases just for personal interest. Remember, just because we are learning this for work, does not mean we cannot use it for fun, too. The idea of using R for fun may seem bizarre, but perhaps by the end of Day 5 we will have been able to convince you otherwise!

A hypsometric map showing the distribution of linefish catches. Another variation of the previous hypsometric map.

The effect of variance (SD) within a temperature time series on the accurate modelling of decadal trends.

The (currently) three most infamous marine heatwaves (MHWs) around the world.

Changes in seaweed biodiversity along the South African coastline.

Most appropriate autoregressive correlation coefficients for areas around western boundary current.

The bathymetry of South Africa with SSTs from the MUR product.

The power of the detected decadal trend at each coastal temperature collection site given a hypothetical number of months.

The strength of the relationship between each site based on their biodiversity.

Current velcoties of Western Boundary Currents.

2 ggplot2, ggplot, and `ggplot()`

R comes with basic graphing capability, known colloquially (by nerds like me) as base R graphics. The syntax used for this method of creating graphics is often difficult to interpret as there are few human words in the code. In addition to this issue, base R graphics also does not allow the user enough control over the look of the final product to satisfy the demands of many publishers. This means that the figures tend not to look professional enough (but still much better than Excel). To solve both of these problems, and others, the ggplot2 package was born. ggplot2 is a widely used, popular graphics package in R, based on Leland Wilkinson’s The Grammar of Graphics.

To avoid confusion, I will distinguish between three related ideas: ggplot2 (the package), ggplot (the system for building graphics), and ggplot() (the function that starts a plot).

Before we dive into syntax, here are three mental models that will keep you oriented as we move through examples:

ggplot builds plots by accumulation, not transformation. You start with a base layer and add components.
Aesthetics are either data-driven or constant — never both. If it comes from data, it belongs inside aes(). If it is a fixed value, it belongs outside.
Grouping is not decoration; it is data identity. It tells ggplot which observations belong together.

Although there are many advantages to using ggplot2 over base R graphics, people do at times raise a few issues with the package. These are criticisms I see mentioned on the internet from time to time, but my opinion differs, and I add my own views in square brackets after each point:

Learning curve The package can have a steep learning curve for beginners, as it employs a different syntax and logic compared to base R graphics. Users may need to invest time and effort to become proficient in using ggplot2 effectively. [I think it is mainly long-time R users that hold this view. My view is that for first-time users, it may be easier to learn compared to base R graphics.]
Customisation limitations Although ggplot2 provides extensive customisation options, there are certain cases where users might find it difficult to achieve the desired level of customisation for their plots. In some instances, base R graphics or other specialised packages might offer better control over specific plot elements. [Maybe, in specialised instances only. I will think of a few such instances and put them here. But I think recent development of many add-on packages have largely eliminated many of the shortfalls people experienced early on. See for example the various extensions available.]
Performance Drawing figures made with ggplot2 (not the coding but the computational speed) can be slower than base R graphics, especially when dealing with large datasets, producing a high number of plots. This may not be ideal for users who require fast, real-time plotting or are working with limited computational resources.
Overhead The package relies on additional packages and dependencies, which can add to the overhead of managing the R environment. Users who prefer a more lightweight approach may find base R graphics more appealing.
Less suitable for 3D plotting It is true that ggplot2 is primarily designed for creating 2D graphics. It is possible to create 3D plots with some workarounds, but ggplot2 may not be the best choice for users who frequently work with 3D data visualisation. Other packages, such as lattice, scatterplot3d, or rayshader, may be better suited for these purposes.
Layered approach complexity The layered approach of ggplot2, while powerful and flexible, can become complex and verbose when building more intricate plots. This might lead to less readable and maintainable code in some cases. [I disagree. I personally find the code more readable, and very intuitive to understand. It is true that it can be quite verbose, that simple graphs can be quickly constructed in base R graphics with fewer lines of code.]

Nevertheless, ggplot2 remains a highly popular, versatile package for data visualisation in R. Its strengths in creating beautiful (but not always by default!), customisable, and complex graphics often outweigh its limitations. It is also less cluttered with non-nonsensical jargon terms, the vocabulary is easier to understand by mere humans — most of the package’s functions are English verbs. So, let us look at the basic concepts in some detail — I will do so mostly by working through numerous examples.

2.1 `geom_*()`, the Pipe (`%>%` or `|>`), and the `+` Sign

Transition: we now move from ggplot as a system to ggplot() as syntax. Keep the first mental model in mind, that is, ggplot builds plots by accumulation.

As part of the tidyverse (as we saw briefly on Day 1, and will go into in depth on Day 4), the ggplot2 package endeavours to use a clean, easy for humans to understand syntax that relies heavily on functions that do what they say. For example, the function geom_point() makes points on a figure. Need a line plot? geom_line() is the way to go! Need both at the same time? No problem. In ggplot we may seamlessly merge a nearly limitless number of objects together to create startlingly sophisticated figures.

Before we go over the code below, it is very important to note the use of the + signs. This is different from the pipe symbol (|> or %>%) used elsewhere in the tidyverse. The + sign indicates that one set of geometric features is added to another, each building on top of what came before. In other words, we add one geometry on top of the next, and in such a way we can arrive at complex graphical representations of data. Effectively, each line of code represents one new geometric feature with its own aesthetic appearance of the figure. It is designed this way so as to make it easier for the human eye to read through the code.

+ Signs in ggplot() Code

One may see below that the code naturally indents itself if the previous line ended with a + sign. This is because R knows that the top line is the parent line and the indented lines are its children. This is a concept that will come up again when we learn about tidying data. What we need to know now is that a block of code that has + signs, like the one below, must be run together. As long as lines of code end in +, R will assume that you want to keep adding lines of code (more geometric features). If we are not mindful of what we are doing we may tell R to do something it cannot and we will see in the console that R keeps expecting more + signs. If this happens, click inside the console window and press the Esc button to cancel the chain of code you are trying to enter.

Debugging Ggplot Messages

Most ggplot2 errors and warnings point to a specific layer or aesthetic. Read them as clues about which layer failed and why. If the message mentions an unknown aesthetic or object, check spelling and whether the variable exists in your data. If it mentions a problem with a scale, check whether you accidentally mapped a constant inside aes() or mixed mapped and constant aesthetics across layers.

2.2 `aes()`

Transition: we now move from accumulation to aesthetic mapping. Keep the second mental model in view: aesthetics are data-driven or constant.

Another recurring function within the parent ggplot() function or the associated geom_*() is aes(). The aes() function in ggplot2 is used to specify the mapping between variables in a dataframe and visual properties of a plot. aes() stands for ‘aesthetic,’ which refers to the visual elements of a plot, such as colour, size, shape, etc. In ggplot, the aesthetics of a plot are defined inside the aes() function, which is passed as an argument to the base ggplot() function or its associated geometry.

It helps to separate positional aesthetics from non-positional aesthetics. Positional aesthetics (x, y) place data in the coordinate system and always create scales. Non-positional aesthetics (colour, size, shape, alpha, group) change how the geometry is drawn, and they may or may not create a scale depending on whether you map them to data.

For example, if you have a dataframe with two variables x and y, you can create a scatterplot of x against y by calling ggplot(data, aes(x, y)) + geom_point(). The aes(x, y) function maps the variables (columns) in the dataframe to the x and y positions of the points in the scatterplot. Similarly, we can map variables in the dataframe to non-positional aesthetics, such as colour (e.g., a colour might be more intense as the magnitude of the values in a column increase), size (larger symbols for bigger values), transparency, or grouping.

3 The World Ocean Atlas (WOA18) Core Dataset

In this course we will increasingly use real ocean data. To keep things simple (and fast), we will use a small, tidy extract of World Ocean Atlas 2018 (WOA18) climatologies.

About the dataset used in this chapter (World Ocean Atlas 2018)

In this chapter we use a small, tidy extract of World Ocean Atlas 2018 (WOA18) climatologies for the broader Southern Africa region.

Why WOA matters in ocean science:

Temperature and salinity are the fundamental state variables of seawater, and together shape density and stratification.
Dissolved oxygen is a key indicator of ventilation, productivity, and habitat suitability.
Nutrients (nitrate, phosphate, silicate) constrain primary production and structure ecosystems.

These variables are not “just numbers”: they encode the physical and biogeochemical structure of the ocean.

# Load libraries
library(tidyverse)
library(here)

# Load the core teaching dataset (WOA18 climatology extract)
woa <- readr::read_csv(
  here::here("data", "SAMOS", "processed", "woa18_sa_core_1deg_monthly.csv"),
  show_col_types = FALSE
)

# Quick look
glimpse(woa)

R> Rows: 200,382
R> Columns: 8
R> $ lat      <dbl> -44.5, -44.5, -44.5, -44.5, -44.5, -44.5, -44.5, -44.5, -44.5…
R> $ lon      <dbl> 6.5, 7.5, 9.5, 12.5, 14.5, 15.5, 19.5, 20.5, 22.5, 24.5, 26.5…
R> $ depth_m  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
R> $ value    <dbl> NA, 295.308, 295.840, NA, 280.251, NA, 270.377, 270.764, 289.…
R> $ month    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
R> $ variable <chr> "dissolved_oxygen", "dissolved_oxygen", "dissolved_oxygen", "…
R> $ unit     <chr> "umol/kg", "umol/kg", "umol/kg", "umol/kg", "umol/kg", "umol/…
R> $ source   <chr> "WOA18 decav 1.00° CSV", "WOA18 decav 1.00° CSV", "WOA18 deca…

Data dictionary

See: data/SAMOS/processed/woa18_sa_core_1deg_monthly_DICTIONARY.md

3.1 First plot: a classic Temperature–Salinity (T–S) view

In oceanography, temperature and salinity are often plotted against each other to reveal water-mass structure. Here we do a simple version of that idea using surface climatology (0 m) for February.

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "salinity")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = salinity, y = temperature)) +
  geom_point(alpha = 0.35, size = 0.8) +
  labs(x = "Salinity (PSU)", y = "Temperature (°C)")

Figure 1: WOA18 surface (0 m) February climatology: temperature vs salinity.

3.2 How to Read This Code

Read the plot like a sentence with grammar: subject → verb → modifiers. The subject is the data (here: a filtered slice of woa), the verb is the geometric action (geom_point()), and the modifiers are the aesthetic mappings (aes(...)) that tell ggplot how to draw. The ggplot(...) line sets the stage (data + core mappings), and each geom_*() line adds a new clause. The + sign means “and then add another clause.”

So what is that code doing? We may see from Figure 1 that it is creating a dot for every grid cell, placing salinity on the x‑axis and temperature on the y‑axis (a very common oceanography view).

As a workflow, it is perfectly normal to build plots incrementally: start with a minimal plot, confirm the axes are correct, then add layers one at a time. Partial plots are not failures; they are thinking tools.

The first line of code is telling R that we want to create a ggplot figure. We know this because we are using the ggplot() function. Inside of that function we are telling R which dataframe (or tibble) we want to create a figure from. Lastly, with the aes() function we tell R what the necessary parts of the figure will be. This is also known as ‘mapping’ (variables map to the visual appearance and arrangement of figure elements).

The second line of code then takes all of that information and makes points (dots) out of it, added as a layer on the set of axes created by the aes() argument provided within ggplot(...) — in other words, we add a ‘geometry’ layer, and hence the name of the kind of ‘shape’ we want to plot the data as is prefixed by geom_.

In many cases (especially time series) you will add a geom_line() layer. With the WOA climatology we typically do not connect points with lines because they are not ordered in time within each location.

However, we can add an extra aesthetic mapping to include a third variable. In Figure 2 we colour points by dissolved oxygen (surface, February). This is an example of mapping a continuous variable to a continuous colour scale.

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "salinity", "dissolved_oxygen")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = salinity, y = temperature)) +
  geom_point(aes(colour = dissolved_oxygen), alpha = 0.6, size = 0.8) +
  scale_colour_viridis_c(name = "Oxygen (µmol/kg)") +
  labs(x = "Salinity (PSU)", y = "Temperature (°C)")

Figure 2: Surface February climatology: T–S scatter coloured by dissolved oxygen.

Do any patterns appear to emerge in Figure 2? Typically oxygen is higher in cooler surface waters (and lower in warmer waters), but the relationship is not purely linear because circulation and biology both matter.

We can still add a simple best‑fit line through the points to demonstrate geom_smooth() (Figure 3):

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "salinity", "dissolved_oxygen")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = temperature, y = dissolved_oxygen)) +
  geom_point(alpha = 0.35, size = 0.8) +
  geom_smooth(method = "lm") +
  labs(x = "Temperature (°C)", y = "Dissolved oxygen (µmol/kg)")

Figure 3: Same plot with a simple linear trend (for demonstration).

4 To `aes()` or Not to `aes()`, That Is the Question

Transition: we now move from mapping to setting, which explains why legends appear (or disappear). Keep the second mental model in view.

The astute eye will have noticed by now that most arguments we have added to the code have been inside of the aes() function. So what exactly is that aes() function doing sitting inside of the other functions? The reason for the aes() function is that it controls the look of the other functions dynamically based on the variables you provide it. If we want to change the look of the plot by some static value we would do this by passing the argument for that variable to the geom of our choosing outside of the aes() function. Let us see what this looks like by changing the colour of the dots.

Why does ggplot need this distinction at all? Because it builds a mapping object before it draws anything. That mapping object tells ggplot which variables drive scales, legends, and warnings. Mapped aesthetics (inside aes()) are treated as data-driven and get scales and legends; constant aesthetics (outside aes()) are treated as fixed settings and do not.

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "salinity")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = salinity, y = temperature)) +
  geom_point(colour = "steelblue", alpha = 0.35, size = 0.8) +
  labs(x = "Salinity (PSU)", y = "Temperature (°C)")

Figure 4: Constant colour set outside aes (no legend is created).

Why did no legend appear in Figure 4? Because the colour was set outside aes(), so ggplot treats it as a fixed setting, not a mapping. If you put a constant inside aes() (for example, aes(colour = "steelblue")), ggplot thinks you are mapping a category called “steelblue” and will create a legend.

Next we set the size of points as a mapped variable (nitrate), while keeping other aesthetics constant.

Pitfalls to Watch for

Mixing mapped and constant aesthetics of the same type across layers (e.g., mapping colour in one layer and setting colour = "blue" in another) often creates duplicated legends, literal labels, or confusing warnings.
If a legend appears unexpectedly, check whether you accidentally mapped a constant value inside aes() (e.g., aes(colour = "blue")).
If a legend is missing, check whether you set the aesthetic outside aes() when you meant to map it to data.

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "salinity", "nitrate")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = salinity, y = temperature)) +
  geom_point(aes(size = nitrate), alpha = 0.35, colour = "black") +
  scale_size_continuous(name = "Nitrate (µmol/kg)") +
  labs(x = "Salinity (PSU)", y = "Temperature (°C)")

Figure 5: Point size mapped to nitrate concentration (surface, February).

Notice that in Figure 5 we mapped point size inside aes(size = nitrate). That creates a size scale and legend. Any size you set outside aes() would be constant for all points.

5 Changing Labels

Transition: we now move from mapping vs setting to labels and legends. Labels are set explicitly; legend titles follow the aesthetics you mapped.

When we use ggplot2 we have control over every minute aspect of our figures if we so wish. The point of that control is not decoration, but communication. A simple heuristic is: reduce cognitive friction, make units explicit, and align legend semantics with the mapped variable. Labels and themes operationalise that heuristic. Clear axis labels remove guesswork, and thoughtful legend placement reduces the effort required to scan the figure.

labs() changes text that clarifies meaning (axis labels, legend titles, captions). theme() controls layout and emphasis (e.g., legend position, text size). When you need to change how data values are converted into visual properties, you will use scale functions such as scale_colour_*() or scale_size_*(). These are the bridge between aesthetics and interpretation, and we will return to them later.

What we want to do next is put the legend on the bottom of our figure with a horizontal orientation and change the axis labels so that they show the units of measurement. To change the labels we will need the labs() function. To change the position of the legend we need the theme() function as it is within this function that all of the little tweaks are performed. This is best placed at the end of your block of ggplot2 code.

woa %>%
  filter(month == 2, depth_m == 0, variable %in% c("temperature", "dissolved_oxygen")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = temperature, y = dissolved_oxygen)) +
  geom_point(aes(colour = dissolved_oxygen), alpha = 0.6, size = 0.9) +
  scale_colour_viridis_c() +
  labs(
    x = "Temperature (°C)",
    y = "Dissolved oxygen (µmol/kg)",
    colour = "Oxygen (µmol/kg)"
  ) +
  theme(legend.position = "bottom")

Figure 6: Temperature vs dissolved oxygen, with explicit labels and legend placement.

Notice that in Figure 6, when we place the legend at the bottom of the figure ggplot automatically makes it horizontal for us. Why do we use ‘colour’ inside of labs() to change the legend title?

6 Reusable WOA examples (for other chapters)

The following small set of plots is deliberately designed to be copy‑pasted and adapted in later chapters: a spatial temperature map (Figure 7), a T–S scatter (Figure 8), and mean vertical profiles (Figure 9).

library(tidyverse)
library(here)

woa <- readr::read_csv(
  here::here("data", "SAMOS", "processed", "woa18_sa_core_1deg_monthly.csv"),
  show_col_types = FALSE
)

# Convenience: a commonly used slice (surface, February)
woa_surf_feb <- woa %>%
  filter(month == 2, depth_m == 0)

woa_surf_feb %>%
  filter(variable == "temperature") %>%
  ggplot(aes(x = lon, y = lat, fill = value)) +
  geom_raster() +
  coord_equal(expand = 0) +
  scale_fill_viridis_c(name = "Temp (°C)") +
  labs(x = "Longitude (°E)", y = "Latitude (°N)") +
  theme_minimal()

Figure 7: WOA18 surface temperature (°C) climatology for February (Southern Africa region).

woa_surf_feb %>%
  filter(variable %in% c("temperature", "salinity")) %>%
  select(lon, lat, variable, value) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  ggplot(aes(x = salinity, y = temperature)) +
  geom_point(alpha = 0.35, size = 0.7) +
  labs(x = "Salinity (PSU)", y = "Temperature (°C)") +
  theme_minimal()

Figure 8: A classic T–S view (surface, February) using WOA18 climatology.

woa %>%
  filter(month == 0, variable %in% c("temperature", "dissolved_oxygen", "nitrate")) %>%
  group_by(variable, unit, depth_m) %>%
  summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    facet_label = paste0(
      str_to_title(str_replace_all(variable, "_", " ")),
      " (", unit, ")"
    )
  ) %>%
  ggplot(aes(x = value, y = depth_m)) +
  geom_path(linewidth = 0.8) +
  scale_y_reverse() +
  facet_wrap(~ facet_label, scales = "free_x", nrow = 1) +
  labs(x = NULL, y = "Depth (m)") +
  theme_bw() +
  theme(strip.text = element_text(size = 11))

Figure 9: Mean vertical profiles (annual climatology, region-average).

Do This Now

With all of this information in hand, please take another five minutes to either improve one of the plots generated or create a beautiful graph of your own. Here are some ideas:
1. See if you can change the thickness of the points/lines.
2. Change the shape, colour, fill, and size of each of the points.
3. Can you find a way to change the name of the legend? What about its labels?
4. Explore the different geom functions available. These include geom_boxplot, geom_density, etc.
5. Try using a different colour palette.
6. Use different themes.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{a._j.2021,
  author = {A. J. , Smit},
  title = {10. {Graphics} with **Ggplot2**},
  date = {2021-01-01},
  url = {http://samos-r.netlify.app/intro_r/10-graphics.html},
  langid = {en}
}

For attribution, please cite this work as:

A. J. S (2021) 10. Graphics with **ggplot2**. http://samos-r.netlify.app/intro_r/10-graphics.html.