Introduction
In this tutorial, we will cover data visualization in R using the ggplot2
package (one of the packages included in the tidyverse
). Specifically, we will learn about
- General
ggplot
syntax - Types of plots
- histograms
- box plots
- bar plots
- scatter plots
- More
ggplot
aesthetic options
ggplot
syntax
The ggplot2
package within the tidyverse
suite of packages utilizes similar syntax to create multiple types of graphs.
All ggplot2
plots will begin with a call to the ggplot()
functions, to which you will supply the data you will be using to make the plot and specify the variables (columns) you would like to use in the plot. We can then add extra features on top of this base plot using other ggplot2
functions, strung together with the +
symbol.
I tend to use the following syntax:
{data_name} %>%
ggplot(aes(x = {x_axis_variable_name}, y = {y_axis_varible_name})) +
geom_{PLOT_TYPE}()
The code in the brackets {}
are the things you should change based on your data and the type of plot you want to make.
The important thing to note about this code is that any variable you want to use from the data frame that you’ve supplied (in this example, data_name
) must be wrapped within the aes()
function. This is how R understands that you are trying to pull a column from the data frame and use it in the plot.
There are lots of different plots that ggplot2
can make, each with a different function. Some common ones that we will use are:
geom_histogram()
: makes a histogramgeom_boxplot()
: makes a boxplotgeom_bar()
: makes a bar plotgeom_point()
: makes a scatterplotgeom_qq()
: makes a quantile-quantile (QQ) plot
Let’s see some examples! We will use the census.rda
data from Tutorial 3. I have already loaded the data for you. Reminder: the data frame is saved under the name census
. Here’s a reminder of what the data look like:
Since we are going to be using the ggplot2
functions, we need to load the tidyverse
! I’m going to have this code auto-run but make sure when you’re working in RStudio that you type and run this line at the beginning of any R session/document that you want to use tidyverse
functions for!
Histograms
A histogram is a visual representation of quantitative data in which the range of values is split into adjacent, non-overlapping intervals (or “bins”) and the number of observations that fall into each interval is counted and depicted on the plot.
Let’s make a histogram of the variable total_personal_income
.
You should see a few messages shown in the output. These messages aren’t errors (they aren’t a problem), but it’s good to know what they mean.
The first message:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
is telling us that the geom_histogram()
function defaulted to splitting the range into 30 intervals (bins). We can change this if we want using either the bins
argument to specify the number of bins desired, or the bindwidth
argument to specify the size of the bins/intervals, in geom_histogram()
.
For example, if we want to make more, smaller width bins we could specify bins = 50
instead of the default of 30 using the following code:
We could also specify the size of the bins, rather than the number of bins. For example, since this variable refers to annual income, maybe we would want to show bins of size $10,000. Specifying the binwidth
argument is highly data-dependent (we wouldn’t want to use a bin size of 10,000 if the variable was height measured in inches, for example 🙂 ). Here’s an example of specifying binwidth
:
The other message is a warning message:
Warning: Removed 108 rows containing non-finite outside the scale range (`stat_bin()`)
This is R’s way of telling us that some of the rows were removed and were not included in the plot. We can see that 108 observations (rows) were removed for plotting. This is generally because there are missing values. The ggplot2
functions don’t have any way of plotting missing values, so they get skipped in these plotting functions.
To see that these observations came from missing values, we can look a bit more into the data. For example, we can count the number of rows for which the column total_personal_income
is an NA
value using the is.na()
function:
Note here that is.na(census$total_personal_income)
returns a vector of TRUE/FALSE
(logical) values. When you take the sum of TRUE/FALSE
values in R, it returns the number that are TRUE
.
For example:
We can see from these histograms that we have a highly right-skewed variable. There are a lot of observations under $100,000 (1e+05) and then a handful of observations all the way to over $400,000.
Box plots
Another way we can visualize quantitative data is using box plots. Box plots can be particularly helpful if you want to compare the distribution of a particular quantitative variable across different groups. For example, maybe we want to compare the distribution of total_personal_income
across different levels of marital_status
. We can do this with box plots.
Here we can see that the labels on the x-axis are overlapping, making it hard to read the plot. We’ll get more into this later when we go over other aesthetic changes we can make to plots made with ggplot.
Bar plots
A bar plot is a tool to visualize a categorical variable. It is similar to a histogram in that it shows the number of times each level of a variable is observed in a data set. For example, in the plot above we showed the distribution of personal income across marital status, but maybe we want to know how many people fell into each of these marital status categories. We can use a bar plot for this.
From this plot we can see that “Never married/single” is the most commonly observed marital status in our data, followed closely by “Married/spouse present”. The status “Separated” was the last common, followed by “Married/spouse absent”. The categories “Divorced” and “Widowed” were somewhere in between.
Scatter plots
A scatter plot is a way of visualizing the relationship between two numeric variables. For example, if we want to know how total_personal_income
relates to age
in this data set, we can use a scatter plot. Here is some example code to make a scatter plot:
More ggplot
aesthetic options
So far we have only made basic plots using ggplot2
, however this package makes it really easy to make much nicer looking plots without too much extra coding. We can do things like change the background color, add titles and subtitles, change the axis labels, and much more. I’ll go over a few of these options but much more can be found online at resources like these: ggplot tutorial, ggplot reference page
We will return to our side-by-side box plot example showing the total personal income by marital status to demonstrate how we can make the plot look a little nicer.
Facets
One really nice feature of ggplot2
is the ability to make multiple plots of the same variable(s) across different subgroups using something called “facets”. For example, let’s say we want to make the same box plots we did before showing the distribution of personal income by marital status, but we want to split by sex to see if there are any different trends between male and female participants in our data set. We can do this using the function facet_wrap()
:
Faceting lets us compare trends across groups. We can facet across more than just one variable, too. For, example, we could facet across both sex and race (column name race_general
). In this case, I like to use a slightly different function called facet_grid()
because it makes the labeling a little easier to follow.
Notice that this busied up the plot a lot. It’s generally only a good idea to use multiple faceting variables if each only has a few (2-3) possible levels to make sure our figure is still interpretable.
Sometimes you’ll notice that when you add facets to a plot, that not every facet takes up the full range of values on either the x-axis or y-axis, so the plots don’t use space very effectively. If you want to change this, you can the following arguments to the facet_wrap()
function:
scales = "free_x
: to allow the x-axis to have different scales (ranges/degree of zoom) across the plotsscales = "free_y
: to allow the y-axis to have different scales (ranges/degree of zoom) across the plotsscales = "free
: to allow both the x- and y-axes to have different scales (ranges/degree of zoom) across the plots
Example:
Labels/titles
One issue with our plot right now is that it doesn’t have a title and the axis labels are column names with underscores. If we wanted to include this figure in a paper, we might want to change these. We can do this by adding the labs()
function to the plot to update the plot labels.
These labels are a lot nicer than the default labels!
Background color
Don’t like the grey grid background? You can change it, too! I like to use a black and white grid background or sometimes just a plain white background. You can achieve this by adding theme_bw()
or theme_classic()
to the plot. There are many other options for plot themes that you can find here: _____
Black and white grid with theme_bw()
White background with theme_classic()
Plot colors
You can also add color to plots to help display information. For example, maybe we want to color the box plots by marital status or by sex to further emphasize a comparison that we’re trying to make. If you want to pick your own colors, see this reference for more information about plot colors in ggplot2
.
Color by marital status
Color by sex
Given that some people have trouble distinguishing color, it is best practice not to have color be the only way in which information is portrayed.
Static colors
You can also change colors without using variables from data. For example, you could make all of the box plots blue by including color="blue"
to the geom_boxplot()
function. If you are just trying to change the color but not trying to use color to indicate the value of a variable, put this argument doesn’t need to go inside an aes()
function. There are lots of colors to choose from in ggplot (link).
color
vs fill
For some types of plots there is an additional argument fill
that you can use to color parts of the plot. For example, in bar plots and histograms, changing the color
argument will change the outline color of the bars while changing the fill
argument will change the color inside the bars.
Example:
Formatting axis text
As we noted before, the text on the x-axis of our box plots showing the different values of marital_status
are overlapping, making it hard to read the plot. One way we can fix this is using the following code to rotate the text:
This code is starting to get pretty technical. You probably won’t memorize this (I haven’t! I look up how to do it every time I need to rotate axis labels, including just now while I was making this tutorial 🙂 ).
You aren’t expected to know how to do this from memory; you can use old example code or search online references/forums for help with this kind of thing.