Intro Biostatistics Resources - Tutorial 4: Data Frames in R

Introduction

By now you should have learned to install R and R Studio, learned a little about types of objects and functions in R, and learned to import data into R. If you’re not sure about any of those, tutorials for each topic can be found here:

Outline

In this tutorial, we’re going to go over manipulating data frames in R, including:

Data frame structure
Making new columns
Updating variables
Filtering data frames

We’re going to use the cdc data from last time in this tutorial as an example. Here’s a reminder of how we read in that dataset (this code will automatically run when the webpage is ready):

Rows and Columns

`nrow()` and `ncol()` functions

The standard setup for a data frame is to put individual observations (so in this case, different people) in the rows, and different attributes (variables) in the columns. In the cdc data frame, each row corresponds to one individual, and each column corresponds to a variable.

From the output above, we can see that the 1st person in the data set has the value "very good" for the variable genhlth, the value 1 for the variable exerany, and so on.

We may want to know how many rows and columns are included in our data. We can find this out using the functions nrow() and ncol().

This tells us that there are 60 rows (60 observations) and 9 columns (9 variables) recorded in our data.

`colnames()` function

If we want to get a list of all the column names in the dataset, we can use the colnames() function in R, which will just print out the names of all of the columns in the data.

Here we can see the names of the 9 columns included in our data. Remember that R is case-sensitive so this can be a helpful function when you first start working with a data set. For example, if I was trying to refer to the column named "age" but I accidentally spelled it with a capital A, "Age", R wouldn’t understand which variable I was referring to. Checking the column names at the start can help avoid future confusion/frustration with misspellings 🙂

Making a new column

As with most things in R, there are multiple ways to make new columns in a data frame. I will be using the tidyverse syntax and functions to do this, but there are other ways to do this.

The function in the tidyverse (and specifically, the dplyr package) to make a new column is mutate().

The syntax will be:

mutate(new_column = some_function(existing_column))

You don’t need to use a pre-existing column to make the new column (you could just type out a vector with the new column values), but we generally will be using pre-existing columns to make new columns.

For example, we can see that there is a column called "exerany" in our data. This column indicates if the individual has exercised in the past month, where a 1 indicates that they have exercised in the past month and 0 indicates that they have not. This is a fairly common coding of binary variables, but unless you have the codebook readily available, you wouldn’t know for sure which response (yes/no) corresponds to which value (1/0). It can be helpful to make new variable that reads as “Yes”/“No” instead of 1/0 for clarity and to make sure we don’t accidentally treat this categorical variable as a numeric variable in later analyses.

We will make a new factor variable called "exerany_f" that takes on the values 0 and 1 with the labels "No exercise in past month" and "Exercise in past month", respectively. We will use the values from the "exerany" variable to help us make the new variable.

Columns names vs object names

Oh no! Why did we get an error?

Let’s take a look at the error message:

Error: object ‘exerany’ not found

This error message is telling us that there isn’t an object called 'exerany' that R can find. This is because 'exerany' is the name of a column in our data frame cdc, not a standalone object within our R environment.

To fix this error, we just need to tell R where to look for the variable called 'exerany'.

Notice that this printed out a data frame that now has 10 columns instead of the previous 9. The new column is the variable we just made, 'exerany_f'. You can’t see it because there is a limit to the number of columns printed, but there is a line telling us that there is 1 more variable: exerany_f <fct>, which means there is also a factor variable called exerany_f that isn’t being shown in the printed output.

Right now, we haven’t actually saved the new column. To see this, let’s use the colnames() function we learned last time.

Basically, mutate will add a new column to your data frame and return a data frame (that then gets printed in your output). If you want to save the new variable to be able to use it later, we have to either update the cdc data frame or save this as a new data frame.

Saving a new data frame

To save a new data frame we can use this code:

Updating the old data frame

If we don’t want to make a new data frame with a new name, we can also just update the original cdc object by replacing cdc_new in the code above with cdc. This will overwrite the old cdc object with the updated data frame with the additional column (be careful when overwriting objects, especially if you’re removing some entries of the data).

Great! If we want to take a look at the new column, we can use a function in the tidyverse package called select() to select just a few columns to look at. Let’s grab the original variable, exerany, and the new one, exerany_f, and look at the first few entries to make sure we defined the new variable correctly.

It looks like this worked! The first 6 rows all had a 1 in exerany so they all show up as "Exercise in the past month" under exerany_f.

Using the `$` symbol to grab a column

Another way to grab a specific column from a data frame is to use the $ symbol, using the following syntax

data_name$column_name

This will grab the column and treat it as a vector.

For example, if we want to grab the new column, exerany_f, and the old column, exerany, and take a look at the last few entries, we can use the tail() function after grabbing the desired column:

Updating existing variables

Sometimes instead of making a new column, we just want to update an existing column. We can use the mutate() function for this too. The syntax this time will be:

mutate(existing_column = some_function(existing_column))

This will overwrite the column with the new values.

For example, let’s say the values "No exercise in the past month" and "Exercise in the past month" that we just created are too long for our liking. Maybe we want to update this column so the values are just "No exercise" and "Exercise". We can do this with the following code:

Notice that now the same column as before exerany_f has been updated from what we originally created and now contains the shorter labels.

Warning: Overwriting

Be careful with updating existing column names (and other R objects for that matter)!

If you make a mistake, this will overwrite the old column, so you won’t have access to the original contents anymore and you will have to rerun some of your code (like when you first read in the data) to reset the code and try again.

Filtering Data Frames

Sometimes we only want to look at a subset of the rows (observations) in our data frame.

For example, maybe I want to run an analysis using only individuals who have a genhlth value of "excellent" or individuals who are at least 30 years old. To do this, I can use the filter() function (also within the dplyr package from the tidyverse).

In this code chunk, I filter so that I am only left with observations who have a “genhlth value of "excellent":

In this code chunk, I filter so that I am only left with observations whose age is at least 30:

Using the filter function, we can then run functions just on the subsetted data. For example, we could check the number of observations for each subet using the nrow() function.

Here we can see that there are 17 observations with genhlth value “excellent” and 48 observations with an age of at least 30. If we know we want to continue using these subsets for future statistical analyses, we might choose to save the filtered data sets with new names so we don’t have to keep using the filter() function.

Now if we want to run any further analysis using just the subset of individuals who were at least 30 years old or had “excellent” general health condition, we can use cdc_30y or cdc_excellent, respectively. This can be helpful if you want to run multiple analyses on the same subset of observations but still want access to the full data set too.