Introduction
In this tutorial, you will learn about working with objects in R. In particular, this tutorial will coverâŚ
- R objects
- Data types in R
- Functions in R
- Installing and using R packages
Working through the tutorial
Running the code chunks in this tutorial
In this tutorial I will introduce a handful of new concepts in R and will include example code to demonstrate how the code works. When you get to a code chunk (an example of a code chunk is shown below), you can edit the code in the chunk if necessary and run the code to see the output by either hitting the âRun Codeâ button or by placing your cursor in the code chunk you want to execute and typing command (â) + return (âŠď¸)
(on mac) or Ctrl + Enter
(on windows).
Sometimes certain code chunks will require output from previous code chunks. I have tried to mark these within the tutorial, but if you ever get any errors, go back and make sure you havenât skipped any sections.
Code output
If a code chunk produces output, you will see it appear below the code chunk after you execute (run) the code in the chunk. If there are multiple lines of code that produce output, you will see the output appear in order below the code chunk if you hit the âRun Codeâ button. You can also execute each line separately using the keystrokes mentioned above.
R Objects
One very basic function of R is that it can be used essentially as a calculator. For example, if we want to check what 2 + 3
is, we can run the code:
Note: the spaces I included in this line of code arenât necessary, but I like to add white space into R code for readability. White spaces generally only matter in R if they are part of character strigns (which weâll get to later)
When you run the code above, it should print out the number 5 (the [1] on the left just indicates that this is the first line of output, which can be helpful later if R code produces more than one line of output). You can try changing the numbers in this code chunk to see how this changes the output.
Thatâs great, but maybe I donât want to have to re-type that line of code every time I want to use those numbers. This is where R objects come in. For example, maybe I have a variable that I want to call x
that will hold the number 2 and a different variable called y
that will hold the number 3. To do this, run the following chunk of code:
=
vs <-
Sometimes you might see the operator <-
(for example x <- 2
) instead of the equals sign (=
). This is an artifact of earlier versions of R when you had to use <-
to assign values to objects; both work now and =
is shorter so I used it here đ
Note that when you run this code, nothing seems to happen! Or at least, thatâs what it looks like. If we were running this in RStudio, we would see in the upper right quadrant under the âEnvironmentâ tab that we now have two objects, x
and y
, that have the values 2 and 3 respectively. To see this here, we can print out the values:
If you get an error that looks like this
Error: object âxâ not found
make sure you hit âRun Codeâ in the previous code chunk where we assigned x
and y
values before trying to print the value of x
. This type of error means that you are trying to use an object that hasnât been defined. Just because you wrote a line of code (say, in an R Markdown file), doesnât mean youâve executed that line of code.
Great! The object x
is now a placeholder for the number 2.
Cool! Now that x
and y
are placeholders for the numbers 2 and 3, we can use them just like numbers. For example, we can add them together, multiply them, divide, subtract, etc. If we want to save the value of x + y
for later use, we can do that too! Below Iâve shown some examples of calculations you can do using these two variables.
Addition, subtraction, multiplication, division
We can do simple arithmetic like adding, subtracting, multiplying, and dividing, using the symbols +
, -
, *
, and /
, respectively.
Powers
We can take powers using the ^
operator:
Logarithm/Exponent
To take the natural (base e) logarithm of a numeric object we can use the log()
function. The inverse of this function is exp()
, which calculates e^x
:
Saving a new variable z
Types of Data
In the example above, we were using numeric variables. However, there are many different types of data in R. Some of the main types of data we will use are
- numeric: numbers
- character: character strings/words
- factor: categorical variables that store data in levels
- logical: binary TRUE/FALSE
- NA: stand-in for missing values
If you arenât sure what type of data an R object is, you can use the following code:
class(name_of_object))
where you would replace name_of_object
with the name of the R object that you want to check the class (data type) of.
Numeric Data
This is a number so we can use the same types of functions as we used above on the object x_numeric
(adding, subtracting, etc.).
Character Data
This is a character. Even though it looks like a number, the quotation marks tell R that we donât want to treat this as the number 5. For example, maybe this is and identifier for hospital number 5 in a study. In this case, we probably donât want to do typical calculations as if this was the number 5.
For example, see what happens when you try to multiply x_character
by 2.
Error messages
Oh no! You should see the following error message:
Error: non-numeric argument to binary operator
This is Râs way of telling you that youâre trying to apply a function that requires a number to an object that isnât a number. R gives helpful error messages like this when you try to run code that doesnât work. Sometimes the language R uses is hard to understand, but you can always ask me or a TA what the error means, or search the error on Google!
Vectors
We donât have to save values in R individually. We can also save a bunch of values (numbers, characters, factors, etc.) together in something called a vector.
We can make this type of R object using the c()
function (this stands for âconcatenateâ) to put a bunch of values together in a vector, separated by a comma. The values within a vector can be any type (numeric, character, logical, factor, etc.), but they must all be the same type, i.e. all numeric, all character, etc. (except for missing values).
We can do lots of fun stuff in R with vectors. For example, we can multiply every entry of a numeric vector by 2, take the mean, sum, or standard deviation of a numeric vector, show the unique entries of the vector, and much more!
Multiply by 2
Take the mean
Take the sum
Take the standard deviation
Show the unique values
Print the length of the vector
Factor Data
Factor variables are categorical variables that can either be numeric or character strings. Sometimes we get data that are coded with numbers even though the underlying variable is categorical. For example, we often code No/Yes variables as 0/1. Or maybe we have three levels of health insurance status: public, private, and uninsured. We may label these categories as 1, 2, and 3, but these numbers are just used for convenience, not because âprivate insuranceâ has anything to do with the number 2.
In these cases, it can be helpful to convert variables to factors. Turning the variable into a factor means R will treat the variable as categorical rather than as a number. We also have the option to choose an order for the variable and we can give the variables nicer, more readable labels, which can both come in handy when making plots and tables from the data. We can accomplish these two tasks using the levels
and labels
arguments, respectively.
Here we can see that x_factor
is a vector of factor variables. Now instead of showing up as numbers 1 through 5, we see the labels from the Likert scale. Our original vector was 1, 2, 3, 4, 5, 2, 4
which correspond to: Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree, Agree, Disagree
. This is the order that is showing up in x_factor
. We can also see all possible levels printed when we print our vector of factor variables.
Logical Data
R has another type of data called âlogicalâ values. These are binary TRUE
/FALSE
(can also abbreviate with T
/F
) that let you know if a statement is true or false. This can be really useful when we want to filter data. For example, if we want to filter data to everyone over the age of 50, we can use logical variables to check if the age variable is greater than 50.
In the example below weâll make a new variable to tell us if x_numeric
from above is equal to the number 5.
We saw above that the ==
operator is Râs way of checking if an object is equal to some value. There some other useful operators such as:
>
: âgreater thanâ>=
: âgreater than or equal toâ<
: âless thanâ<=
: âless than or equal toâ%in%
: âis an element ofâ (this will be helpful when we get to vectors later)
Go ahead and play around with the code below to try different operators:
Missing Values
R has a special way of denoting missing values. In R, these show up as NA
values. For example, if you have a data set with 100 individualsâ height, weight, and age, if you werenât able to get the values measured for some of the individuals, those values should show up as NA
values to indicate that they are missing.
Note: this is different from "NA"
the character. R will change the text color of NA
when you type it in your code to show that this is a special kind of variable.
The is.na()
function
There is a special function in R that can help you check if a value is a missing value. This function is is.na()
. To see how it works, try running the code below:
Functions
Weâve already shown some examples of functions in R. You can think of R functions kind of like a recipe. You need to tell R what dish you want to make (the function) and give it some ingredients (called arguments), and then R will make the dish for you!
Functions in R all use the same general syntax:
function_name(argument1 , argument2 , ...)
where here the name of the function is function_name
and the arguments for the function are passed to the function inside a set of parentheses. Some functions have no arguments, some have one, and some have many.
For example, the mean()
function we used above is a built-in R function that takes a vector of numbers (it can also take a vector of logical values or some more advanced data types like dates) and returns the mean (average) value by adding up all of the entries in the vector and dividing by the length. Functions in R are really helpful because it means we donât have to write out the full code ourselves.
For example, both lines of code below do the same thing, but one is a lot easier (imagine if the vector had been even longer)!
Sometimes the functions we use will take multiple arguments, and sometimes they only need one. With all R functions, the arguments are given names. For example, technically the argument name for the function mean()
is called x
, which is just a placeholder for any R objects that you can calculate the mean of. If we wanted to be really technical, we could have used the following line of code:
I try to use argument names in functions that use multiple arguments, because if you donât assign the inputs to specific argument names, R has to assume that youâve put the input in a specific order. Sometimes, if Iâm only using the most basic argument (like x
in the mean()
function), I may forget to or choose not to include the name of the argument to make the code simpler.
If you ever need help figuring out what arguments a certain function in R takes, you can use the âHelpâ tab in R Studio (in the bottom right quadrant) or type ?function_name
in your R Console (for example, ?mean
). This will open a documentation page for the function and should include information about the argument names and the types of objects can be used as input for the function.
R Packages
What is an R package
R comes with a lot of great built-in, ready-to-use functions, but there are plenty of other functions that we might want to use that donât come built in to R. You can write your own functions in R if you want, but usually someone else has already written the kind of function you want to use. This is where R packages come in.
The following description of R packages comes from this R Basics Tutorial from R-Ladies Sydney.
A package is a bundle of code that a generous person has written, tested, and then given away. Most of the time packages are designed to solve a specific problem, so they to pull together functions related to a particular data science problem (e.g., data wrangling, visualisation, inference). Anyone can write a package, and you can get packages from lots of different places, but for beginners the best thing to do is get packages from CRAN, the Comprehensive R Archive Network. Itâs easier than any of the alternatives, and people tend to wait until their package is stable before submitting it to CRAN, so youâre less likely to run into problems. You can find a list of all the packages on CRAN here.
How to install R packages
You should generally only need to install a package one time on your computer. Installing packages is just like downloading any other piece of software on your computer; once youâve downloaded it once, you have it on your computer unless you delete it.
Installing packages is different from loading packages, which weâll get to next.
Loading Packages
Just because you installed an R package once, doesnât mean you want to use it for every task you do in R. Because of this, R will only automatically load a small set of default packages when you launch a new session. Other than the small set of default packages, youâll need to tell it if you want to use any additional packages. To load a library that isnât loaded by default, we use the following code:
library(package_name)
Think of installing and loading R packages as similar to using the Microsoft Word application on your computer. Once you have Microsoft Word installed on your computer, you have the software available to you. But that doesnât mean Word is always running on your computer. In order to use Word, you have to open the application. This is essentially what we are doing when we load a libraryâweâre opening it up so we can use it.
Another reason we donât automatically load all of the packages weâve ever installed on our computer is because sometimes different packages have functions that are named the same thing (this is because R is open-source)! When this happens, it can be frustrating to try to figure out which version of the function R is using and requires extra code to tell R which version of the function you want to use. To avoid this it is best practice to only load the packages you currently need.
Remember that you have to install an R package once before you ever use it. After youâve installed it once, you donât need to re-install it, you just need to load it into your session to have access to its contents.
If you ever run a line of code like this:
library(mypackage)
and get an error message like this:
Error in library(mypackage) : there is no package called âmypackageâ
this means you are trying to load a package you havenât installed yet and you need to run install.packages("mypackage")
once first.
The tidyverse package
The R tidyverse
package is a very useful suite of packages that make data manipulation and visualization clean and efficient. To see all of the packages included in the tidyverse
package run the following code to print the names of all packages within the tidyverse
suite:
Error message
Oh no! We got another error message. You should see the following error:
Error: could not find function âtidyverse_packagesâ
This is Râs way of telling you that youâre trying to use a function that it doesnât have access to. The tidyverse_packages()
function comes from the tidyverse
package. I have already installed the tidyverse
for you in the setup of this tutorial, but I didnât load it.
To give ourselves access to the functions from the tidyverse
package, letâs add the line of code library(tidyverse)
before the line of code in the chunk above. That should fix the problem.
The pipe operator from the tidyverse
One of the unique functions provided by the tidyverse
is called the âpipe operator.â The pipe operator looks like this: %>%
. The pipe operator is a way of stringing together a bunch of functions and the purpose is to make your code easier to read.
%>%
vs |>
Recent updates to R also let you use |>
as the pipe operator, but Iâm old so I still use %>%
đ
To see an example of this, consider the following task:
We want to know how many unique values are contained in the vector x_vector
.
To do this, we can start by having R tell us the unique values in x_vector
using unique(x_vector)
. This will give us a vector of the unique values in x_vector
. Then, we can print the length of the output of unique(x_vector)
using the length()
function. This will print out the number of entries in the vector of unique entries of x_vector
.
To do this, we can use multiple functions simultaneously by nesting the parentheses:
Notice how the order of operations was:
Start with your input
x_vector
Determine the unique values of
x_vector
Calculate length of
unique(x_vector)
However, reading left-to-right we see the last step first. This can get particularly confusing when we want to apply multiple sequential functions to the same object in R. Imagine we had wanted to do 3 more functions! This would have read as:
function3(function2(function1(length(unique(x_vector)))))
Thatâs a lot of parentheses and itâs hard to even figure out where the operation starts! The pipe operator can be helpful in these scenarios. Returning to our example, we can print the number of unique entries of x_vector
as follows:
See how the output is the same but the order of operations is much clearer in this example? Itâs clear that we start with the object x_vector
, which we then extract just the unique values from, then finally calculate the length of just the vector of unique values.
In the example above with an extra 3 functions at the end, this would read as:
x_vector %>% unique() %>% length() %>% function1() %>% function2() %>% function3()
Still long, but a lot easier to follow (at least I think so)!
End of tutorial
Thatâs it! The next tutorial will be about importing data into R.
Comments
You may notice some lines of code that start with the pound symbol,
#
, and show up in a different text color inside the code chunks. These are called comments. R doesnât execute anything when it sees a comment. Comments are just notes to yourself or anyone else reading your code! Itâs good practice to add comments to your code at each line to help you remember what the code was meant to do, or to help someone else who may be reading your code understand what the code is supposed to be doing.Try writing your own comment in the code chunk below!
Think about what the line of code should start with.
If you hit âRun Codeâ you should see that nothing is executed.