Introduction

In this tutorial, we will cover data visualization in R using the ggplot2 package (one of the packages included in the tidyverse). Specifically, we will learn about

General ggplot syntax
Types of plots
- histograms
- box plots
- bar plots
- scatter plots
More ggplot aesthetic options

`ggplot` syntax

The ggplot2 package within the tidyverse suite of packages utilizes similar syntax to create multiple types of graphs.

All ggplot2 plots will begin with a call to the ggplot() functions, to which you will supply the data you will be using to make the plot and specify the variables (columns) you would like to use in the plot. We can then add extra features on top of this base plot using other ggplot2 functions, strung together with the + symbol.

I tend to use the following syntax:

{data_name} %>%

ggplot(aes(x = {x_axis_variable_name}, y = {y_axis_varible_name})) +

geom_{PLOT_TYPE}()

The code in the brackets {} are the things you should change based on your data and the type of plot you want to make.

The important thing to note about this code is that any variable you want to use from the data frame that you’ve supplied (in this example, data_name) must be wrapped within the aes() function. This is how R understands that you are trying to pull a column from the data frame and use it in the plot.

There are lots of different plots that ggplot2 can make, each with a different function. Some common ones that we will use are:

geom_histogram(): makes a histogram
geom_boxplot(): makes a boxplot
geom_bar(): makes a bar plot
geom_point(): makes a scatterplot
geom_qq(): makes a quantile-quantile (QQ) plot

Let’s see some examples! We will use the census.rda data from Tutorial 3. I have already loaded the data for you. Reminder: the data frame is saved under the name census. Here’s a reminder of what the data look like:

Since we are going to be using the ggplot2 functions, we need to load the tidyverse! I’m going to have this code auto-run but make sure when you’re working in RStudio that you type and run this line at the beginning of any R session/document that you want to use tidyverse functions for!

Histograms

A histogram is a visual representation of quantitative data in which the range of values is split into adjacent, non-overlapping intervals (or “bins”) and the number of observations that fall into each interval is counted and depicted on the plot.

Let’s make a histogram of the variable total_personal_income.

You should see a few messages shown in the output. These messages aren’t errors (they aren’t a problem), but it’s good to know what they mean.

The first message:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

is telling us that the geom_histogram() function defaulted to splitting the range into 30 intervals (bins). We can change this if we want using either the bins argument to specify the number of bins desired, or the bindwidth argument to specify the size of the bins/intervals, in geom_histogram().

For example, if we want to make more, smaller width bins we could specify bins = 50 instead of the default of 30 using the following code:

We could also specify the size of the bins, rather than the number of bins. For example, since this variable refers to annual income, maybe we would want to show bins of size $10,000. Specifying the binwidth argument is highly data-dependent (we wouldn’t want to use a bin size of 10,000 if the variable was height measured in inches, for example 🙂 ). Here’s an example of specifying binwidth:

The other message is a warning message:

Warning: Removed 108 rows containing non-finite outside the scale range (`stat_bin()`)

This is R’s way of telling us that some of the rows were removed and were not included in the plot. We can see that 108 observations (rows) were removed for plotting. This is generally because there are missing values. The ggplot2 functions don’t have any way of plotting missing values, so they get skipped in these plotting functions.

To see that these observations came from missing values, we can look a bit more into the data. For example, we can count the number of rows for which the column total_personal_income is an NA value using the is.na() function:

Sum of logical values

Note here that is.na(census$total_personal_income) returns a vector of TRUE/FALSE (logical) values. When you take the sum of TRUE/FALSE values in R, it returns the number that are TRUE.

For example:

We can see from these histograms that we have a highly right-skewed variable. There are a lot of observations under $100,000 (1e+05) and then a handful of observations all the way to over $400,000.

Box plots

Another way we can visualize quantitative data is using box plots. Box plots can be particularly helpful if you want to compare the distribution of a particular quantitative variable across different groups. For example, maybe we want to compare the distribution of total_personal_income across different levels of marital_status. We can do this with box plots.

Note

Here we can see that the labels on the x-axis are overlapping, making it hard to read the plot. We’ll get more into this later when we go over other aesthetic changes we can make to plots made with ggplot.

Bar plots

A bar plot is a tool to visualize a categorical variable. It is similar to a histogram in that it shows the number of times each level of a variable is observed in a data set. For example, in the plot above we showed the distribution of personal income across marital status, but maybe we want to know how many people fell into each of these marital status categories. We can use a bar plot for this.

From this plot we can see that “Never married/single” is the most commonly observed marital status in our data, followed closely by “Married/spouse present”. The status “Separated” was the last common, followed by “Married/spouse absent”. The categories “Divorced” and “Widowed” were somewhere in between.

Scatter plots

A scatter plot is a way of visualizing the relationship between two numeric variables. For example, if we want to know how total_personal_income relates to age in this data set, we can use a scatter plot. Here is some example code to make a scatter plot:

More `ggplot` aesthetic options

So far we have only made basic plots using ggplot2, however this package makes it really easy to make much nicer looking plots without too much extra coding. We can do things like change the background color, add titles and subtitles, change the axis labels, and much more. I’ll go over a few of these options but much more can be found online at resources like these: ggplot tutorial, ggplot reference page

We will return to our side-by-side box plot example showing the total personal income by marital status to demonstrate how we can make the plot look a little nicer.

Labels/titles

One issue with our plot right now is that it doesn’t have a title and the axis labels are column names with underscores. If we wanted to include this figure in a paper, we might want to change these. We can do this by adding the labs() function to the plot to update the plot labels.

These labels are a lot nicer than the default labels!

Background color

Don’t like the grey grid background? You can change it, too! I like to use a black and white grid background or sometimes just a plain white background. You can achieve this by adding theme_bw() or theme_classic() to the plot. There are many other options for plot themes that you can find here: _____

Black and white grid with `theme_bw()`

White background with `theme_classic()`

Plot colors

You can also add color to plots to help display information. For example, maybe we want to color the box plots by marital status or by sex to further emphasize a comparison that we’re trying to make. If you want to pick your own colors, see this reference for more information about plot colors in ggplot2.

Color by marital status

Color by sex

Using color to display information

Given that some people have trouble distinguishing color, it is best practice not to have color be the only way in which information is portrayed.

Static colors

You can also change colors without using variables from data. For example, you could make all of the box plots blue by including color="blue" to the geom_boxplot() function. If you are just trying to change the color but not trying to use color to indicate the value of a variable, put this argument doesn’t need to go inside an aes() function. There are lots of colors to choose from in ggplot (link).

color vs fill

For some types of plots there is an additional argument fill that you can use to color parts of the plot. For example, in bar plots and histograms, changing the color argument will change the outline color of the bars while changing the fill argument will change the color inside the bars.

Example:

Formatting axis text

As we noted before, the text on the x-axis of our box plots showing the different values of marital_status are overlapping, making it hard to read the plot. One way we can fix this is using the following code to rotate the text:

Looking up technical code

This code is starting to get pretty technical. You probably won’t memorize this (I haven’t! I look up how to do it every time I need to rotate axis labels, including just now while I was making this tutorial 🙂 ).

You aren’t expected to know how to do this from memory; you can use old example code or search online references/forums for help with this kind of thing.

Tutorial 5: Data Visualization in R

Introduction

`ggplot` syntax

Histograms

Box plots

Bar plots

Scatter plots

More `ggplot` aesthetic options

Facets

Labels/titles

Background color

Black and white grid with `theme_bw()`

White background with `theme_classic()`

Plot colors

Color by marital status

Color by sex

Static colors

Formatting axis text

Introduction

ggplot syntax

Histograms

Box plots

Bar plots

Scatter plots

More ggplot aesthetic options

Facets

Labels/titles

Background color

Black and white grid with theme_bw()

White background with theme_classic()

Plot colors

Color by marital status

Color by sex

Static colors

Formatting axis text

`ggplot` syntax

More `ggplot` aesthetic options

Black and white grid with `theme_bw()`

White background with `theme_classic()`