Tutorial 5: Data Visualization in R

R Tutorials
Exploratory Data Analysis (EDA)
Author

Haley Grant

Introduction

In this tutorial, we will cover data visualization in R using the ggplot2 package (one of the packages included in the tidyverse). Specifically, we will learn about

  • General ggplot syntax
  • Types of plots
    • histograms
    • box plots
    • bar plots
    • scatter plots
  • More ggplot aesthetic options

ggplot syntax

The ggplot2 package within the tidyverse suite of packages utilizes similar syntax to create multiple types of graphs.

All ggplot2 plots will begin with a call to the ggplot() functions, to which you will supply the data you will be using to make the plot and specify the variables (columns) you would like to use in the plot. We can then add extra features on top of this base plot using other ggplot2 functions, strung together with the + symbol.

I tend to use the following syntax:

{data_name} %>%

ggplot(aes(x = {x_axis_variable_name}, y = {y_axis_varible_name})) +

geom_{PLOT_TYPE}()

The code in the brackets {} are the things you should change based on your data and the type of plot you want to make.

The important thing to note about this code is that any variable you want to use from the data frame that you’ve supplied (in this example, data_name) must be wrapped within the aes() function. This is how R understands that you are trying to pull a column from the data frame and use it in the plot.

There are lots of different plots that ggplot2 can make, each with a different function. Some common ones that we will use are:

  • geom_histogram(): makes a histogram
  • geom_boxplot(): makes a boxplot
  • geom_bar(): makes a bar plot
  • geom_point(): makes a scatterplot
  • geom_qq(): makes a quantile-quantile (QQ) plot

Let’s see some examples! We will use the census.rda data from Tutorial 3. I have already loaded the data for you. Reminder: the data frame is saved under the name census. Here’s a reminder of what the data look like:

Since we are going to be using the ggplot2 functions, we need to load the tidyverse! I’m going to have this code auto-run but make sure when you’re working in RStudio that you type and run this line at the beginning of any R session/document that you want to use tidyverse functions for!

Histograms

A histogram is a visual representation of quantitative data in which the range of values is split into adjacent, non-overlapping intervals (or “bins”) and the number of observations that fall into each interval is counted and depicted on the plot.

Let’s make a histogram of the variable total_personal_income.

You should see a few messages shown in the output. These messages aren’t errors (they aren’t a problem), but it’s good to know what they mean.

The first message:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

is telling us that the geom_histogram() function defaulted to splitting the range into 30 intervals (bins). We can change this if we want using either the bins argument to specify the number of bins desired, or the bindwidth argument to specify the size of the bins/intervals, in geom_histogram().

For example, if we want to make more, smaller width bins we could specify bins = 50 instead of the default of 30 using the following code:

We could also specify the size of the bins, rather than the number of bins. For example, since this variable refers to annual income, maybe we would want to show bins of size $10,000. Specifying the binwidth argument is highly data-dependent (we wouldn’t want to use a bin size of 10,000 if the variable was height measured in inches, for example 🙂 ). Here’s an example of specifying binwidth:

The other message is a warning message:

Warning: Removed 108 rows containing non-finite outside the scale range (`stat_bin()`)

This is R’s way of telling us that some of the rows were removed and were not included in the plot. We can see that 108 observations (rows) were removed for plotting. This is generally because there are missing values. The ggplot2 functions don’t have any way of plotting missing values, so they get skipped in these plotting functions.

To see that these observations came from missing values, we can look a bit more into the data. For example, we can count the number of rows for which the column total_personal_income is an NA value using the is.na() function:

Note here that is.na(census$total_personal_income) returns a vector of TRUE/FALSE (logical) values. When you take the sum of TRUE/FALSE values in R, it returns the number that are TRUE.

For example:

We can see from these histograms that we have a highly right-skewed variable. There are a lot of observations under $100,000 (1e+05) and then a handful of observations all the way to over $400,000.

Box plots

Another way we can visualize quantitative data is using box plots. Box plots can be particularly helpful if you want to compare the distribution of a particular quantitative variable across different groups. For example, maybe we want to compare the distribution of total_personal_income across different levels of marital_status. We can do this with box plots.

Note

Here we can see that the labels on the x-axis are overlapping, making it hard to read the plot. We’ll get more into this later when we go over other aesthetic changes we can make to plots made with ggplot.

Bar plots

A bar plot is a tool to visualize a categorical variable. It is similar to a histogram in that it shows the number of times each level of a variable is observed in a data set. For example, in the plot above we showed the distribution of personal income across marital status, but maybe we want to know how many people fell into each of these marital status categories. We can use a bar plot for this.

From this plot we can see that “Never married/single” is the most commonly observed marital status in our data, followed closely by “Married/spouse present”. The status “Separated” was the last common, followed by “Married/spouse absent”. The categories “Divorced” and “Widowed” were somewhere in between.

Scatter plots

A scatter plot is a way of visualizing the relationship between two numeric variables. For example, if we want to know how total_personal_income relates to age in this data set, we can use a scatter plot. Here is some example code to make a scatter plot:

More ggplot aesthetic options

So far we have only made basic plots using ggplot2, however this package makes it really easy to make much nicer looking plots without too much extra coding. We can do things like change the background color, add titles and subtitles, change the axis labels, and much more. I’ll go over a few of these options but much more can be found online at resources like these: ggplot tutorial, ggplot reference page

We will return to our side-by-side box plot example showing the total personal income by marital status to demonstrate how we can make the plot look a little nicer.

Facets

One really nice feature of ggplot2 is the ability to make multiple plots of the same variable(s) across different subgroups using something called “facets”. For example, let’s say we want to make the same box plots we did before showing the distribution of personal income by marital status, but we want to split by sex to see if there are any different trends between male and female participants in our data set. We can do this using the function facet_wrap():

Faceting lets us compare trends across groups. We can facet across more than just one variable, too. For, example, we could facet across both sex and race (column name race_general). In this case, I like to use a slightly different function called facet_grid() because it makes the labeling a little easier to follow.

Caution: Busy plots

Notice that this busied up the plot a lot. It’s generally only a good idea to use multiple faceting variables if each only has a few (2-3) possible levels to make sure our figure is still interpretable.

Sometimes you’ll notice that when you add facets to a plot, that not every facet takes up the full range of values on either the x-axis or y-axis, so the plots don’t use space very effectively. If you want to change this, you can the following arguments to the facet_wrap() function:

  • scales = "free_x : to allow the x-axis to have different scales (ranges/degree of zoom) across the plots
  • scales = "free_y : to allow the y-axis to have different scales (ranges/degree of zoom) across the plots
  • scales = "free : to allow both the x- and y-axes to have different scales (ranges/degree of zoom) across the plots

Example:

Labels/titles

One issue with our plot right now is that it doesn’t have a title and the axis labels are column names with underscores. If we wanted to include this figure in a paper, we might want to change these. We can do this by adding the labs() function to the plot to update the plot labels.

These labels are a lot nicer than the default labels!

Background color

Don’t like the grey grid background? You can change it, too! I like to use a black and white grid background or sometimes just a plain white background. You can achieve this by adding theme_bw() or theme_classic() to the plot. There are many other options for plot themes that you can find here: _____

Black and white grid with theme_bw()

White background with theme_classic()

Plot colors

You can also add color to plots to help display information. For example, maybe we want to color the box plots by marital status or by sex to further emphasize a comparison that we’re trying to make. If you want to pick your own colors, see this reference for more information about plot colors in ggplot2.

Color by marital status

Color by sex

Using color to display information

Given that some people have trouble distinguishing color, it is best practice not to have color be the only way in which information is portrayed.

Static colors

You can also change colors without using variables from data. For example, you could make all of the box plots blue by including color="blue" to the geom_boxplot() function. If you are just trying to change the color but not trying to use color to indicate the value of a variable, put this argument doesn’t need to go inside an aes() function. There are lots of colors to choose from in ggplot (link).

For some types of plots there is an additional argument fill that you can use to color parts of the plot. For example, in bar plots and histograms, changing the color argument will change the outline color of the bars while changing the fill argument will change the color inside the bars.

Example:

Formatting axis text

As we noted before, the text on the x-axis of our box plots showing the different values of marital_status are overlapping, making it hard to read the plot. One way we can fix this is using the following code to rotate the text:

Looking up technical code

This code is starting to get pretty technical. You probably won’t memorize this (I haven’t! I look up how to do it every time I need to rotate axis labels, including just now while I was making this tutorial 🙂 ).

You aren’t expected to know how to do this from memory; you can use old example code or search online references/forums for help with this kind of thing.