For this tutorial, we’ll cover how to generate basic 2-dimensional plots in R
To get started, we’ll load the ggplot2 library. We’ll make use of the openly available diamonds dataset that imports with ggplot2.

library(ggplot2)

Let’s take a quick look at the structure of this data.

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

To see the description of the diamonds dataset, we can run:

?diamonds

As we can see from the printout and from the description of the dataset, it contains information about 53,940 diamonds, and their qualities, such as their weight (coded carat), the quality of their cut (coded cut), their price, and so on.

Let’s say we want to explore the relationship between two of the variables, say carat and price. We can plot one of these variables along the \(x\)-axis of a “Cartesian” (i.e., coordinate) plane, and the other on the \(y\)-axis to discover the nature of that relationship.

Plotting two variables in `base R`

In base R, we can use the in-built function plot. (Use ?plot to read its description in more detail.) By default, the plot function will require at least two arguments: an x and y variable. These will be interpreted and displayed by plot as the \(x\) and \(y\) coordinates of each point. (For this reason, plot expects x and y to each be of the same length.)

# Let us assign carat to the variable called 'x', and price to 'y'
x <- diamonds$carat
y <- diamonds$price
plot(x, y)

# Note that "x" and "y", as interpeted above by `plot()`
# are the values we stored in the two lines of code above.

This is called a scatterplot. It shows us the “bivariate” relationship between two variables by plotting, for each datapoint, the value of one variable along one axis, and the corresponding value of the other variable along the other axis. We can glean that if points with low values for one variable also have low values of the other variable, or if datapoints with high values of one variable have high values of the other variable as well, that the two variables are positively correlated. We see such a trend, among other things, from plotting carat and price here.

Now that we have gotten a cursory sense of how the two variables co-occur, let’s add a regression line to visualize the statistical relationship between them. To fit a regression in R, we can use the lm function, which will fit a ‘linear model’. (Use ?lm to see a full description of the funciton.) Here we’ll use the formulae syntax in-built into R, as follows: y ~ x. Here, our \(y\) variable is the price of the diamond, and we’ll predict it from carat, our \(x\) variable.

diamonds_model <- lm(y ~ x) # recall that we stored price and carat as 'y' and 'x'
# Or, equivalently: `diamonds_model <- lm(price ~ carat, data = diamonds)`

Now let’s add a line to visualize the relationship between the variables. The function we use to graph a regression line is abline. By default, abline must be called after an existing scatterplot is already drawn. This also means that you don’t need to re-generate a plot when you use it. R knows automatically that you want to draw your regression line overtop of the your existing scatterplot if you simply call plot, and then abline. To abline, we supply the model object that we fit using lm, and abline automatically converts it to a line.

# Here we re-generate the scatterplot:
plot(x = x, y = y)
abline(diamonds_model, col = 'red')

Customizing the plots

As you may have noticed, we supplied an additional argument above: col = 'red'. col is an argument (or “parameter”) that the function abline recognizes, and which the user can specify the color of the line drawn by abline.

There are many other parameters that we can customize to make our graphs look publication ready! To see a list of all of the customizeable “graphical parameters”, you can run:

?`graphical parameter`

We’ll start with the following, for just a quick overview:

col: we can supply a character string, such as ‘red’, or a number, such as 2 to specify the color for a (or all) point(s)
pch: we can supply 1, 2, 3, 4, up to 25 to change the shape of the points
xlab; ylab: these parameters can be used to set the labels for the x and y axes
main: supply a character string to main to set the title of the plot.
cex: this controls the size of the points, as well as text, such as axis labels, on the plot
xlim; ylim: Set the range of values spanning the x and y axes

For the abline function (and for plot depending on the type parameter) we can also change different qualities of the line that we plot. These can take the form of, but are not limited to, for example:

lty: change the type of line (e.g, dotted, dashed, solid, etc)
lwd: the thickness of the line

-Again, col, and other parameters can be applied as in plot

Let’s give it a whirl

plot(x, y,
     col = 'chocolate', # there are many in-built colors in R's palette!
     pch = 20,
     xlab = 'Carat',
     ylab = 'Price',
     main = 'Relationship between the carat and price of a sample of diamonds',
     xlim = c(-2, 6))
# And we'll add the regression line
abline(diamonds_model, col = 'blue', lty = 2, lwd = 2.4)

Bonus advanced aesthetics: Just as the abline function simply draws a line overtop an existing plot, you can command R to do this any number of times using the following method:

Generate any plot, such as the ones drawn above with plot
In the next line, run: par(new = TRUE) # or equivalently par(new = T)
Generate another plot, such as one using plot, and it will be superimposed
repeat any number of times

plot(x, y,
     col = 'black', # there are many in-built colors in R's palette!
     pch = 20,
     cex = 4,
     xlab = 'Carat',
     ylab = 'Price',
     main = 'Relationship between the carat and price of a sample of diamonds')
# instruct R to draw the next plot on top of the existing:
par(new = T)
# Generate a new plot
plot(x, y,
     col = 'chocolate', # there are many in-built colors in R's palette!
     pch = 20,
     cex = 3,
     xlab = '', # we don't want, or need, to redraw the same axis label and title text
     ylab = '', # we don't want, or need, to redraw the same axis label and title text
     main = '') # we don't want, or need, to redraw the same axis label and title text
# And add the regression line as before
abline(diamonds_model, col = 'blue', lwd = 3)

Plotting using `ggplot2`

ggplot2 (and its predecessor ggplot) is a package imported by the larger tidyverse package, and developed by the developers of the other packages imported by tidyverse. It is the data visualization arm of the tidy-verse.

ggplot2 is based on the so-called “Grammar of Graphics” (hence the “gg” in “ggplot’). This refers to a literal”grammar": it is a way of speaking about graphs, including most elements that we notice right away, like the color of the points, to more subtle features like the ‘coordinate system’ itself. A basic breakdown of the grammar is as follows:

Layer
- Data
- Mapping
- Stat
- Geom
Scale
Coordinate System
Facets (Visit https://vita.had.co.nz/papers/layered-grammar.pdf for more extensive information.)

Basically, ggplot2 graphs our images one layer at a time. In each layer, we have the option to specify what the data is that is being plotted, what variables from that data might be associated with what axis in our coordinate plane, what shape that coordinate plane might take (i.e., square, like a cartesian plane, or rounded, using polar coordinaes).

The basic idea is that by inputing data, assigning, projecting, or “mapping” the variables from that data into a coordinate system, and optionally stretching or shrinking the scale of that system, (as well as optionally changing the shape and color of our datapoints), we have all the basic components to turn any variables into a wide variety of plot types, from a ‘line graph’, ‘bar graph’, ‘scatterplot’, ‘radar plot’, ‘parallel coordinate plot’, you name it.

1. Create the first layer.

To create a plot in ggplot2, we establish the first layer using ggplot(). We can even see what happens when we plot only this layer:

ggplot()

We created a layer for a graph with no data. Without data, a graph still exists… In this case (by default), we generated a rectangular or ‘cartesian’ plane. This means that the x-axis is perpindicular to the ‘y’. We can’t really tell this from the above because there is no data. So let’s add some.

2. Add data using `mapping =` argument/parameter.

To mapping, we are required to supply the variables which we wish to map using the aes function. The idea behind aes is that we may want to map many aspects of our data to different aesthetics elements of the layer we’re plotting (demonstrated below). Nevertheless they are all still “aesthetic mappings”, and so we set mapping = aes(). Additionally, if we are to supply variables, we must have some data (unles we’re going) to use those “x” and “y” variables that we created earlier in this tutorial. But typically, we wish to instruct ggplot to graph variable from within a dataframe that we are supplying as a whole to ggplot. So below we set the argument data = diamonds, and then can set mapping = aes(x = carat, y = price)

ggplot(data = diamonds, mapping = aes(x = carat, y = price))

Interesting. Now this layer has been populated with some information, but what is still missing?

We have mapped our variables to the x and y axes, but we have not assigned them a geometric representation! Even if we put our variables into a coordinate system, we can conceivably generate several different aesthetic representation of them. To visualize the data as a scatterplot, we use geom_point.

3. Add the “geom”

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point()

In the above example, we separated each layer using the +, in accordance with ggplot syntax. What had carried along from layer to layer was the aesthetic mapping.

As in our demonstration using base R, we can add another layer to graph the regression line. The regression line is nothing more than a statistical transformation of our data as well. In fact, the regression equation itself can be solved for using regular arithmetic. In the “grammar of graphics”, this is therefore added using another layer, and an in-built (in ggplot) function called geom_smooth.

Under the hood, geom_smooth is calculating the transformation of the x and y variables to generate the regression, and then visualizing it as a line.

We request certain other aspects of the aesthetic representation
- We set method = 'lm' to request a linear (flat) regression
- We explicate that the we’re modelling y ~ x, as before.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = 'y ~ x')

Customizing other elements of `ggplot` graphs

As in base R, we can customize the color of our plot, the points in it, the lines it generates and so on. In ggplot these are all aesthetic representations of the data. So most of this can be done inside of the call to aes. An advantage of doing so, is that we have more flexibility than we do in base R to change how each individual data point is represented in our graph.

For example, let’s change the color of each point in the above graph, so that it’s shade will be tweaked, continuously, as the price increases.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point(aes(color = price)) +
  geom_smooth(method = 'lm', formula = 'y ~ x')

We can also change the palette, so that the maximum price is mapped to a different hue (and likewise for the minimum). This is a reiteration of the concept that our data are at all times merely being mapped into different scales and aesthetic systems.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point(aes(color = price)) +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  scale_color_gradient(low = 'gold', high = 'chocolate')

Further customization.

For each geom_ function, a known set of parameters exist. For example, the first two arguments supplied to geom_point are the x and y coordinates, because no matter what coordinate system we are project the data into, a point will be comprised of only two coordinates. For other “geoms”, such as geom_contour, we need a \(z\) variable as well, for example.
Below, we’ll take a look at the parameters that can be passed to geom_point

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point( 
    # again, mapping the price variable to the color of the points:
    aes(color = price), 
    # the next two aesthetics are not 'mapped', because they are not coming from the data:
    shape = 24, # exactly the same as "pch" in base R
    size = 3 # the size of the point
  ) + 
  geom_smooth(
    method = 'lm', formula = 'y ~ x',
    linetype = 2 # works the same as lty in base R
  ) + 
  scale_color_gradient(low = 'gold', high = 'chocolate') +
  coord_flip() # Here, we flip the location of the x and y axes

Using a facet, we can create a seperate graph for different levels of a grouping variable

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point( 
    # again, mapping the price variable to the color of the points:
    aes(color = price), 
    # the next two aesthetics are not 'mapped', because they are not coming from the data:
    shape = 24, # exactly the same as "pch" in base R
    size = 3 # the size of the point
  ) + 
  geom_smooth(
    method = 'lm', formula = 'y ~ x',
    linetype = 2 # works the same as lty in base R
  ) + 
  scale_color_gradient(low = 'gold', high = 'chocolate') +
  coord_flip() + # Here, we flip the location of the x and y axes
  facet_wrap(~cut) # similar to the formula syntax

We can also change the type of regression line if we think the relationship is non-linear. In addition, the layer of the graph that constitutes the coordinate plane can be modified to change the color, scale, etc. We’ll change the color and the drawing of the grid lines here as well. These changes are mostly conducted using theme() functions.

library(tidyverse)
ggplot(data = sample_frac(diamonds, size = .4), # just plot 40% of the data to reduce computational cost
       mapping = aes(x = carat, y = price)) +
  geom_point( 
    # again, mapping the price variable to the color of the points:
    aes(color = price), 
    # the next two aesthetics are not 'mapped', because they are not coming from the data:
    shape = 24, # exactly the same as "pch" in base R
    size = 3 # the size of the point
  ) + 
  geom_smooth(
    method = 'loess', # fit a non-parametric smoothing curve
    formula = 'y ~ x',
    se = FALSE, # let's not plot the standard error band to reduce computing cost
    linetype = 2 # works the same as "lty" in base R
  ) + 
  scale_color_gradient(low = 'gold', high = 'chocolate') +
  # Now to change some of the elements of the coordinate plane
  theme(plot.title = element_text(hjust = .5), text = element_text(size = 12)) +
  theme(panel.border = element_rect(color = 'black', fill = NA, size = .9)) +
  theme(panel.background = element_rect(color = 'black', fill = 'white')) +
  theme(panel.grid = element_line(color = 'lightgrey', size = .2)) +
  theme(plot.background = element_rect(fill = 'white')) +
  theme(axis.text = element_text(size = 10, color = 'black')) +
  # We can add a title too
  ggtitle(label = 'Here is the Plot Title!')

Graphing in R: 101

Plotting two variables in base R