Tutorial 6: Data Summarization in R

R Tutorials
Exploratory Data Analysis (EDA)
Author

Haley Grant

Introduction

In this tutorial, we will be covering how to summarize data using R. This can include:

  • single variable summaries
    • counts,
    • proportions,
    • measures of center (mean, median),
    • measures of spread (standard deviation, variance, IQR)
  • bivariate summaries
    • counts by group,
    • proportions by group,
    • measures of center by group,
    • measures of spread by group,
    • correlation

We will once again be using the census data from Tutorials 3 and 5 to demonstrate how to create numerical summaries using R. I’ve already read the data in for you; feel free to run the code below to remind yourself of the variables in the data set that we will be working with.

We will also be utilizing the tidyverse package again, so let’s go ahead and load that package into our work space so we don’t forget!

Single variable summaries

We’ll start with single variable summaries, that is, getting summary metrics for just one variable at a time.

Quantitative variables

In Tutorial 5, we made a histogram and box plot to display the variable total_personal_income. This is a quantitative variable, so the summary metrics that we can use are measures of center such as mean or median and measures of spread such as standard deviation, variance, and interquartile range (IQR).

The summarise() function

We can calculate these summary statistics within the data set, we can utilize the summarise() function from the dplyr package (part of the tidyverse). This function is similar to the mutate() function that we used in Tutorial 4 in that it can create new columns. However, instead of just adding new columns to an existing data set, the summarise() function condenses the data and only shows the summarized information. To see this, let’s try taking the mean of the variable total_personal_income both with the mutate() function and with the summarise() function.

Try running the following two code chunks to see the difference between mutate() and summarise()

Notice how the mutate() implementation adds a new column with the mean income in the data set repeated for every row, whereas the summarise() implementation just prints the mean income and exludes all individual rows.

The drop_na() function

Notice that we had to drop the missing values (NA values) from the total_personal_income column to calculate the mean. We will have to do this to calculate all of the additional summary statistics, so one option we have is to exclude all of the rows with a missing value in the total_personal_income column before applying the summarise() function. We can do this with the drop_na() function (from the tidyr package also within the tidyverse). Inside the parentheses, we can include the names of all of the columns that we want drop missing observations from. If we leave this blank (just write drop_na() with nothing in the parentheses, R will remove any row with a missing value in any column). In this case we’ll just drop the rows with missing total personal income. This will be helpful when we want to compute multiple statistics and don’t want to include a na.rm = T argument in each function.

To see how the drop_na() function works, we can run the following code to check the number of rows in the data frame with and without applying the drop_na() function.

Note

Note that drop_na() doesn’t permanently remove the rows with missing values unless save you save the data frame with an assign operator (= or <-). For example,

census2 <- census %>%
  drop_na(total_personal_income)

Would save the filtered data set with no missing values in total_personal_income under the new name census2. If we had used the old name (census instead of census2), this would overwrite our old data frame. Be careful with this and only overwrite your data if you know you want to keep the changes you’re making!

If we use the filtered data with no missing values in the total_personal_income column, we can omit the na.rm = T argument when we take the mean:

Calculating multiple statistics

We don’t have to just calculate one summary statistic at a time. For example, we can calculate the mean, median, minimum, maximum, 25th percentile, 75th percentile, standard deviation, and variance, all within one sumamrise() function.

Notice that the mean is much larger than the median, indicate left skew in the data.

We can couple these statistics with a histogram to get a fuller picture of the distribution.

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(x = total_personal_income)) +
  geom_histogram(binwidth = 10000, fill = "skyblue", color = "black") + # bins of size $10,000
  theme_bw() + 
  labs(x = "Total personal income ($)")
```

Qualitative variables

In Tutorial 5, we made a bar plot showing the breakdown of marital status in the census data. We can also compute summary statistics for this variable. For categorical variables like this, there aren’t as many summary statistics we can compute; we generally stick to counts and proportions.

The xtabs() function

One way to do this is to use the xtabs() function from the stats package (which comes pre-loaded in R). The syntax for this function is:

xtabs( ~ variable_name, data = data_name)

We will also use this function later when we want to make a contingency table in which we tabulate an outcome across different groups.

If we want to show this in terms of proportions instead of raw counts, we can apply the prop.table() function after this:

Here we can see that never married/single is the most frequent category in the data (n = 222, 44.4%), followed by married with spouse present (n = 192, 38.4%), divorced (n = 38, 7.6%), widowed (n = 31, 6.2%), married with spouse absent (n = 14, 2.8%), and the least common was separated (n = 3, 0.6%).

The n() and group_by() functions

We can also make these tables using the summarise function like we did above. However, to do this, we are going to need to first utilize the group_by() function. The group_by() function from the dplyr package allows us to specify groups within which we will calculate summary statistics. To make a table of counts, we will first group by martial status and then use the n() function, which simply counts the number of observations present. By grouping first, we will ensure that we are counting the number of observations within each level of the marital_status column separately.

If we want to calculate proportions, we can add an additional line of code:

Again, we can add a plot to visualize this difference:

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(x = marital_status, fill = marital_status)) +
  geom_bar(color = "black") + 
  theme_classic() + 
  labs(title = "Marital Status", x = element_blank()) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), # rotate axis labels
      legend.position = "none"  ) # remove color labels (labeled on x-axis, color is just for fun :) )

```

Bivariate summaries

We often want to summarise more than one variable at a time, for example maybe we want to evaluate differences in total personal income by sex or evaluate the association between age and total personal income.

One quantitative and one qualitative

Suppose we want to compare total personal income between biological sexes. To do this, we can calculate the same summary statistics we did above for total personal income, but first group by sex to calculate these statistics within each group. We can then make comparisons to see how similar/different they are. We will again utilize the group_by() function.

Here we can see that within this sample, the mean income for female respondents was $17,441 and median was $14,000 while for male respondents these figures were $39,809 and $24,050, respectively. There was also a lot more variability in income reported by male respondents, with a standard deviation of $59,926 compared to $16,774 among female respondents.

We can also see this in side-by-side boxplots or histograms.

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(x = total_personal_income, fill = sex)) +
  geom_histogram(binwidth = 10000,  color = "black") + # bins of size $10,000
  theme_bw() + 
  labs(x = "Total personal income ($)") + 
  facet_wrap(. ~ sex, nrow = 2) + 
  theme(legend.position = "none")

```

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(y = total_personal_income, x = sex, color = sex)) +
  geom_boxplot() + 
  theme_bw() + 
  labs(y = "Total personal income ($)", x = "Biological Sex") + 
  theme(legend.position = "none")

```

Two qualitative variables

For two categorical variables, we can again make tables of counts and proportions and compare across groups. For example, we could check the breakdown of marital status by biological sex. We can do this using the xtabs() function again by simply adding in the extra variable:

xtabs( ~ variable_name1 + variable_name2, data = data_name)

In this case, adding proportions (or at least marginal totals) will be helpful if the groups are different sizes. In this case we will have options for how we want to take the proportions. We can:

  • take the proportion overall
  • take the proportion within rows
  • take the proportion within columns

Which one of these we want to use will depend on the question we’re trying to answer.

Table with margins (row and column sums)

Table of proportions (overall)

This table shows the overall proportion. It shows that 20% of our sample is male and married with a spouse present and 18.4% of our sample is female and married with a spouse present.

Table of proportions (by row)

This table shows the proportion within rows. The proportions sum to 1 along the rows. This tells us that 48% of male respondents were never married/single compared to 40% of female respondents who were never married/single. About 8% of the female respondents were widowed, compared to only 4% of the male respondents.

Table of proportions (by column)

Finally, this table shows the proportion within columns. The proportions sum to 1 along the columns. This tells us that among those who were widowed, 64.5% were female and 35.5% were male.

Tables with summarise()

We can also make these tables by using the summarise() function paired with group_by as before:

Each of these tables shows something slightly different and the one that is most relevant will depend on the trend we are trying to evaluate and possibly the sampling mechanism.

To view these in a plot, we have a lot of options. Two good options are to either calculate proportions and plot the proportions side-by-side, or to plot the counts for each group separately and place the plots next to/ on top of each other.

This is the code used to produce the plot:

```{webr-r}
census %>%
  group_by(sex, marital_status) %>%
  summarise(count = n()) %>%
  mutate(prop = count/sum(count)) %>%
  ggplot(aes(x = marital_status, y = prop, fill = sex)) +
  geom_bar(color = "black", position = "dodge", stat = "identity") + 
  theme_classic() + 
  labs(title = "Marital Status Proportions", x = element_blank(),
       y = "Proportion", fill = element_blank()) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)  )# rotate axis labels

```

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(x = marital_status, fill = marital_status)) +
  geom_bar(color = "black") + 
  theme_classic() + 
  labs(title = "Marital Status by Sex", x = element_blank()) + 
  facet_wrap(. ~ sex, nrow = 1) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), # rotate axis labels
      legend.position = "none"  ) # remove color labels (labeled on x-axis, color is just for fun :) )
```

Two quantitative variables

If we want to evaluate the association between two quantitative variables, we typically use the correlation between the two variables (denoted R or r).

Correlation

To take the correlation between two variables, we can use the cor() function from the stats package. The function is expecting data in the form of two vectors. For example, if we want to take the correlation between age and total personal income, we can grab them both from the data set using the data_name$variable_name notation and then take the correlation. Once again, we will need to remove missing values to get a numeric value here. The argument to do this within the cor() function is the use argument, which we will set equal to "pairwise.complete.obs", which tells R to only use rows of the data that are complete (non-missing) for both age and total_personal_income columns.

We can also do this within a summarise() function :

Here we can see that there is a slight positive correlation of 0.131 between age and total personal income. We can also see this somewhat in a scatterplot:

This is the code used to produce the plot:

```{webr-r}
census %>%
  ggplot(aes(x = age, y = total_personal_income)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Age (years)", y = "Total Personal Income ($)")
```