Intro Biostatistics Resources - Tutorial 2: R Objects, Functions, and Packages

Introduction

In this tutorial, you will learn about working with objects in R. In particular, this tutorial will cover…

R objects
Data types in R
Functions in R
Installing and using R packages

Working through the tutorial

Running the code chunks in this tutorial

In this tutorial I will introduce a handful of new concepts in R and will include example code to demonstrate how the code works. When you get to a code chunk (an example of a code chunk is shown below), you can edit the code in the chunk if necessary and run the code to see the output by either hitting the “Run Code” button or by placing your cursor in the code chunk you want to execute and typing command (⌘) + return (↩︎) (on mac) or Ctrl + Enter (on windows).

Sometimes certain code chunks will require output from previous code chunks. I have tried to mark these within the tutorial, but if you ever get any errors, go back and make sure you haven’t skipped any sections.

Code output

If a code chunk produces output, you will see it appear below the code chunk after you execute (run) the code in the chunk. If there are multiple lines of code that produce output, you will see the output appear in order below the code chunk if you hit the “Run Code” button. You can also execute each line separately using the keystrokes mentioned above.

The ‘Start over’ button

Each code chunk has a button in the upper right corner that allows you to reset the code in the chunk. If you make a mistake and want to start from scratch, hit this button!

Comments

You may notice some lines of code that start with the pound symbol, #, and show up in a different text color inside the code chunks. These are called comments. R doesn’t execute anything when it sees a comment. Comments are just notes to yourself or anyone else reading your code! It’s good practice to add comments to your code at each line to help you remember what the code was meant to do, or to help someone else who may be reading your code understand what the code is supposed to be doing.

# this is an example of a comment!

Exercise:

Try writing your own comment in the code chunk below!

Hint

Think about what the line of code should start with.

If you hit “Run Code” you should see that nothing is executed.

Solution

R Objects

One very basic function of R is that it can be used essentially as a calculator. For example, if we want to check what 2 + 3 is, we can run the code:

Note: the spaces I included in this line of code aren’t necessary, but I like to add white space into R code for readability. White spaces generally only matter in R if they are part of character strigns (which we’ll get to later)

When you run the code above, it should print out the number 5 (the [1] on the left just indicates that this is the first line of output, which can be helpful later if R code produces more than one line of output). You can try changing the numbers in this code chunk to see how this changes the output.

That’s great, but maybe I don’t want to have to re-type that line of code every time I want to use those numbers. This is where R objects come in. For example, maybe I have a variable that I want to call x that will hold the number 2 and a different variable called y that will hold the number 3. To do this, run the following chunk of code:

Note: = vs <-

Sometimes you might see the operator <- (for example x <- 2) instead of the equals sign (=). This is an artifact of earlier versions of R when you had to use <- to assign values to objects; both work now and = is shorter so I used it here 🙂

Note that when you run this code, nothing seems to happen! Or at least, that’s what it looks like. If we were running this in RStudio, we would see in the upper right quadrant under the “Environment” tab that we now have two objects, x and y, that have the values 2 and 3 respectively. To see this here, we can print out the values:

If you get an error

If you get an error that looks like this

Error: object ‘x’ not found

make sure you hit “Run Code” in the previous code chunk where we assigned x and y values before trying to print the value of x. This type of error means that you are trying to use an object that hasn’t been defined. Just because you wrote a line of code (say, in an R Markdown file), doesn’t mean you’ve executed that line of code.

Great! The object x is now a placeholder for the number 2.

Exercise:

The goal of the following line of code is to print the value of the object y, but it has an error. Can you find the mistake in the code? See if you can figure out how to fix it so the code returns the correct answer (the value of y is 3).

Solution

The problem with the code above is that I used a capital Y instead of lowercase y. R is case-sensitive, so it won’t assume you mean y if you type Y. This is something to be aware of in the future!

Cool! Now that x and y are placeholders for the numbers 2 and 3, we can use them just like numbers. For example, we can add them together, multiply them, divide, subtract, etc. If we want to save the value of x + y for later use, we can do that too! Below I’ve shown some examples of calculations you can do using these two variables.

Addition, subtraction, multiplication, division

We can do simple arithmetic like adding, subtracting, multiplying, and dividing, using the symbols +, -, *, and /, respectively.

Exercise:

I’ve shown addition and multiplication below, see if you can do subtraction and division!

Powers

We can take powers using the ^ operator:

Logarithm/Exponent

To take the natural (base e) logarithm of a numeric object we can use the log() function. The inverse of this function is exp(), which calculates e^x :

Saving a new variable z

Types of Data

In the example above, we were using numeric variables. However, there are many different types of data in R. Some of the main types of data we will use are

numeric: numbers
character: character strings/words
factor: categorical variables that store data in levels
logical: binary TRUE/FALSE
NA: stand-in for missing values

If you aren’t sure what type of data an R object is, you can use the following code:

class(name_of_object))

where you would replace name_of_object with the name of the R object that you want to check the class (data type) of.

Numeric Data

This is a number so we can use the same types of functions as we used above on the object x_numeric (adding, subtracting, etc.).

Character Data

This is a character. Even though it looks like a number, the quotation marks tell R that we don’t want to treat this as the number 5. For example, maybe this is and identifier for hospital number 5 in a study. In this case, we probably don’t want to do typical calculations as if this was the number 5.

For example, see what happens when you try to multiply x_character by 2.

Error messages

Oh no! You should see the following error message:

Error: non-numeric argument to binary operator

This is R’s way of telling you that you’re trying to apply a function that requires a number to an object that isn’t a number. R gives helpful error messages like this when you try to run code that doesn’t work. Sometimes the language R uses is hard to understand, but you can always ask me or a TA what the error means, or search the error on Google!

Exercise:

If you change x_character to x_numeric in the code above, you should see it works just fine! Try it out yourself!

Vectors

We don’t have to save values in R individually. We can also save a bunch of values (numbers, characters, factors, etc.) together in something called a vector.

We can make this type of R object using the c() function (this stands for “concatenate”) to put a bunch of values together in a vector, separated by a comma. The values within a vector can be any type (numeric, character, logical, factor, etc.), but they must all be the same type, i.e. all numeric, all character, etc. (except for missing values).

We can do lots of fun stuff in R with vectors. For example, we can multiply every entry of a numeric vector by 2, take the mean, sum, or standard deviation of a numeric vector, show the unique entries of the vector, and much more!

Multiply by 2

Take the mean

Take the sum

Take the standard deviation

Show the unique values

Print the length of the vector

Exercise:

The goal of the following line of code is to find and print the maximum value of the vector x_vector, but it has an error. Can you find the mistake in the code? See if you can figure out how to fix it so the code returns the correct answer (the maximum value in the vector is 5).

Solution

The problem with the code above is that I accidentally included the name of the vector object (x_vector) in quotation marks ("x_vector"). If you include the name of an object in quotation marks, R will just think it’s a character string and not the name of a saved object in the directory. If you remove the quotation marks, the code should work as expected!

Factor Data

Factor variables are categorical variables that can either be numeric or character strings. Sometimes we get data that are coded with numbers even though the underlying variable is categorical. For example, we often code No/Yes variables as 0/1. Or maybe we have three levels of health insurance status: public, private, and uninsured. We may label these categories as 1, 2, and 3, but these numbers are just used for convenience, not because “private insurance” has anything to do with the number 2.

In these cases, it can be helpful to convert variables to factors. Turning the variable into a factor means R will treat the variable as categorical rather than as a number. We also have the option to choose an order for the variable and we can give the variables nicer, more readable labels, which can both come in handy when making plots and tables from the data. We can accomplish these two tasks using the levels and labels arguments, respectively.

Here we can see that x_factor is a vector of factor variables. Now instead of showing up as numbers 1 through 5, we see the labels from the Likert scale. Our original vector was 1, 2, 3, 4, 5, 2, 4 which correspond to: Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree, Agree, Disagree. This is the order that is showing up in x_factor. We can also see all possible levels printed when we print our vector of factor variables.

Logical Data

R has another type of data called “logical” values. These are binary TRUE/FALSE (can also abbreviate with T/F) that let you know if a statement is true or false. This can be really useful when we want to filter data. For example, if we want to filter data to everyone over the age of 50, we can use logical variables to check if the age variable is greater than 50.

In the example below we’ll make a new variable to tell us if x_numeric from above is equal to the number 5.

Exercise:

Try changing the value in the code above to check if x_numeric is equal to the number 4 to see how the value would change.

Note: White spaces and case in character objects

Recall that I mentioned above that adding extra white spaces in R generally doesn’t impact your code (but can help with readability). One instance where white spaces do matter is inside characters. Run the code below to see an example of this:

Notice here that the statement is false, these two characters are not equal (because one has an extra space).

R is also case-sensitive:

Other Inequalities/Operators

We saw above that the == operator is R’s way of checking if an object is equal to some value. There some other useful operators such as:

>: “greater than”
>=: “greater than or equal to”
<: “less than”
<=: “less than or equal to”
%in%: “is an element of” (this will be helpful when we get to vectors later)

Go ahead and play around with the code below to try different operators:

Missing Values

R has a special way of denoting missing values. In R, these show up as NA values. For example, if you have a data set with 100 individuals’ height, weight, and age, if you weren’t able to get the values measured for some of the individuals, those values should show up as NA values to indicate that they are missing.

Note: this is different from "NA" the character. R will change the text color of NA when you type it in your code to show that this is a special kind of variable.

Exercise:

Try multiplying the x_missing variable by 2 and see what happens.

Generally any kind of functions applied to missing values return another NA value unless the function has a special way of handling missing values.

The `is.na()` function

There is a special function in R that can help you check if a value is a missing value. This function is is.na(). To see how it works, try running the code below:

Functions

We’ve already shown some examples of functions in R. You can think of R functions kind of like a recipe. You need to tell R what dish you want to make (the function) and give it some ingredients (called arguments), and then R will make the dish for you!

Functions in R all use the same general syntax:

function_name(argument1 , argument2 , ...)

where here the name of the function is function_name and the arguments for the function are passed to the function inside a set of parentheses. Some functions have no arguments, some have one, and some have many.

For example, the mean() function we used above is a built-in R function that takes a vector of numbers (it can also take a vector of logical values or some more advanced data types like dates) and returns the mean (average) value by adding up all of the entries in the vector and dividing by the length. Functions in R are really helpful because it means we don’t have to write out the full code ourselves.

For example, both lines of code below do the same thing, but one is a lot easier (imagine if the vector had been even longer)!

Note: Argument Names

Sometimes the functions we use will take multiple arguments, and sometimes they only need one. With all R functions, the arguments are given names. For example, technically the argument name for the function mean() is called x, which is just a placeholder for any R objects that you can calculate the mean of. If we wanted to be really technical, we could have used the following line of code:

I try to use argument names in functions that use multiple arguments, because if you don’t assign the inputs to specific argument names, R has to assume that you’ve put the input in a specific order. Sometimes, if I’m only using the most basic argument (like x in the mean() function), I may forget to or choose not to include the name of the argument to make the code simpler.

If you ever need help figuring out what arguments a certain function in R takes, you can use the ‘Help’ tab in R Studio (in the bottom right quadrant) or type ?function_name in your R Console (for example, ?mean). This will open a documentation page for the function and should include information about the argument names and the types of objects can be used as input for the function.

R Packages

What is an R package

R comes with a lot of great built-in, ready-to-use functions, but there are plenty of other functions that we might want to use that don’t come built in to R. You can write your own functions in R if you want, but usually someone else has already written the kind of function you want to use. This is where R packages come in.

The following description of R packages comes from this R Basics Tutorial from R-Ladies Sydney.

A package is a bundle of code that a generous person has written, tested, and then given away. Most of the time packages are designed to solve a specific problem, so they to pull together functions related to a particular data science problem (e.g., data wrangling, visualisation, inference). Anyone can write a package, and you can get packages from lots of different places, but for beginners the best thing to do is get packages from CRAN, the Comprehensive R Archive Network. It’s easier than any of the alternatives, and people tend to wait until their package is stable before submitting it to CRAN, so you’re less likely to run into problems. You can find a list of all the packages on CRAN here.

How to install R packages

Note: Installing packages

You should generally only need to install a package one time on your computer. Installing packages is just like downloading any other piece of software on your computer; once you’ve downloaded it once, you have it on your computer unless you delete it.

Installing packages is different from loading packages, which we’ll get to next.

Loading Packages

Just because you installed an R package once, doesn’t mean you want to use it for every task you do in R. Because of this, R will only automatically load a small set of default packages when you launch a new session. Other than the small set of default packages, you’ll need to tell it if you want to use any additional packages. To load a library that isn’t loaded by default, we use the following code:

library(package_name)

Note: Argument Names (continued)

Just to reiterate the previous point about argument names, technically the argument name used here in the library() function is called package. This is assumed to be the first argument, and we generally won’t utilize any of the other arguments, which allow you to customize how you want to load a library and if you want anything printed when you load it, etc. But again, we could utilize the argument name like so:

library(package = package_name)

Think of installing and loading R packages as similar to using the Microsoft Word application on your computer. Once you have Microsoft Word installed on your computer, you have the software available to you. But that doesn’t mean Word is always running on your computer. In order to use Word, you have to open the application. This is essentially what we are doing when we load a library–we’re opening it up so we can use it.

Note: loading packages

Another reason we don’t automatically load all of the packages we’ve ever installed on our computer is because sometimes different packages have functions that are named the same thing (this is because R is open-source)! When this happens, it can be frustrating to try to figure out which version of the function R is using and requires extra code to tell R which version of the function you want to use. To avoid this it is best practice to only load the packages you currently need.

Loading uninstalled packages

Remember that you have to install an R package once before you ever use it. After you’ve installed it once, you don’t need to re-install it, you just need to load it into your session to have access to its contents.

If you ever run a line of code like this:

library(mypackage)

and get an error message like this:

Error in library(mypackage) : there is no package called ‘mypackage’

this means you are trying to load a package you haven’t installed yet and you need to run install.packages("mypackage") once first.

The tidyverse package

The R tidyverse package is a very useful suite of packages that make data manipulation and visualization clean and efficient. To see all of the packages included in the tidyverse package run the following code to print the names of all packages within the tidyverse suite:

Error message

Oh no! We got another error message. You should see the following error:

Error: could not find function “tidyverse_packages”

This is R’s way of telling you that you’re trying to use a function that it doesn’t have access to. The tidyverse_packages() function comes from the tidyverse package. I have already installed the tidyverse for you in the setup of this tutorial, but I didn’t load it.

To give ourselves access to the functions from the tidyverse package, let’s add the line of code library(tidyverse) before the line of code in the chunk above. That should fix the problem.

Exercise:

Add the line of code library(tidyverse) before the line of code in the chunk above and try rerunning the code chunk.

Solution

The pipe operator from the tidyverse

One of the unique functions provided by the tidyverse is called the “pipe operator.” The pipe operator looks like this: %>%. The pipe operator is a way of stringing together a bunch of functions and the purpose is to make your code easier to read.

Note: %>% vs |>

Recent updates to R also let you use |> as the pipe operator, but I’m old so I still use %>% 🙂

To see an example of this, consider the following task:

We want to know how many unique values are contained in the vector x_vector.

To do this, we can start by having R tell us the unique values in x_vector using unique(x_vector). This will give us a vector of the unique values in x_vector. Then, we can print the length of the output of unique(x_vector) using the length() function. This will print out the number of entries in the vector of unique entries of x_vector.

To do this, we can use multiple functions simultaneously by nesting the parentheses:

Notice how the order of operations was:

Start with your input x_vector
Determine the unique values of x_vector
Calculate length of unique(x_vector)

However, reading left-to-right we see the last step first. This can get particularly confusing when we want to apply multiple sequential functions to the same object in R. Imagine we had wanted to do 3 more functions! This would have read as:

function3(function2(function1(length(unique(x_vector)))))

That’s a lot of parentheses and it’s hard to even figure out where the operation starts! The pipe operator can be helpful in these scenarios. Returning to our example, we can print the number of unique entries of x_vector as follows:

See how the output is the same but the order of operations is much clearer in this example? It’s clear that we start with the object x_vector, which we then extract just the unique values from, then finally calculate the length of just the vector of unique values.

In the example above with an extra 3 functions at the end, this would read as:

x_vector %>% unique() %>% length() %>% function1() %>% function2() %>% function3()

Still long, but a lot easier to follow (at least I think so)!

End of tutorial

That’s it! The next tutorial will be about importing data into R.

Introduction

Working through the tutorial

Running the code chunks in this tutorial

Code output

The ‘Start over’ button

Comments

R Objects

Addition, subtraction, multiplication, division

Powers

Logarithm/Exponent

Saving a new variable z

Types of Data

Numeric Data

Character Data

Error messages

Vectors

Multiply by 2

Take the mean

Take the sum

Take the standard deviation

Show the unique values

Print the length of the vector

Factor Data

Logical Data

Missing Values

The is.na() function

Functions

R Packages

What is an R package

How to install R packages

Loading Packages

The tidyverse package

Error message

The pipe operator from the tidyverse

End of tutorial

The `is.na()` function