Tech Tips

  • Tech Tips
  • 12.14.20

Combining Data from Multiple Sources Using R

  • by Mike Wallinga, Director of Institutional Research, Northwestern College

Gathering the Data

A common request of our institutional research office is to combine data from multiple sources and varying formats. For example, we keep our demographic and academic information about our students in an SQL database, but another office on campus might keep track of their student program signups in Excel. Furthermore, maybe we just received some raw survey data in SPSS format. How can we make these three very different data formats work together?

There are a variety of tools that can accomplish this task, both free and commercial, but my favorite technique is to write a short R script. There are so many R packages for accessing data from different sources that I have yet to run into a file format that I can’t read using R. In this Tech Tip, we’re going to focus on three formats commonly used in IR work: an SQL database, Excel, and SPSS.

Before we do anything else, we’ll load the five R packages we’re going to use (if you’ve never used them before, you’ll have to install them first). The following code accomplishes this:

library(DBI)     # for connecting to databases
library(readxl)  # for reading Excel files
library(writexl) # for creating new Excel files
library(haven)   # for reading files from SPSS, SAS, and Stata
library(dplyr)   # for combining the data sets

First, let’s query the SQL database. R has several packages for connecting to databases. In this example, we will use the DBI package (see https://db.rstudio.com/dbi/). The DBI package supports connecting to a variety of database types, both commercial and open source. You may need to consult with your IT department to determine the proper database driver and connection string to use. In this example, we are using a SQLite database for simplicity (see https://db.rstudio.com/databases/sqlite/).

In our database, there are two tables, one containing information about students, and one containing information about the majors offered at the institution. In the student table, rather than wasting space storing the full name of each student’s major, we only store the three-letter code. If we want to include the majors’ full titles for use in a table or graph, we need to combine the two tables using the field they have in common (the major code). In SQL terms, this is called a join.

# Establish a connection to our SQLite database
con <- dbConnect(RSQLite::SQLite(), "./data/students.db")

# Write our query using SQL syntax
query <- "SELECT Students.stuid, Majors.majorname
          FROM Students INNER JOIN Majors
               ON Students.majorid = Majors.majorid
"

# Execute the query, fetch the results, and store them in a data frame
students <- dbFetch(dbSendQuery(con, query))

The resulting data frame looks like this:

stuid  majorname 
123 Mathematics
345 Computer Science
624 History 
789 Biology 
851 Computer Science
555 History 
992 Biology 
484 Music
717  English
839 Biology 
481  Music
989 English
385 Mathematics
222 Computer Science

Next, let’s say that our student activities department uses Excel to maintain a list of students signed up for intramural athletics. Once they’ve shared that file with us, we can use the read_excel function found in the readxl package (see https://readxl.tidyverse.org/) to retrieve the data and store them in another data frame.

intramurals <- read_excel("./data/intramural_signups.xlsx")

That data frame looks like this:

studentid  sport 
 123  basketball
 123  badminton
 222  badminton
 385  basketball
 385  volleyball
 481  basketball
 555  volleyball
 717  basketball
 789  basketball
 851  volleyball
 851  badminton
 992  badminton
 992  volleyball

Similarly, let’s suppose we received raw data from a survey in SPSS format. The survey asked questions about students’ satisfaction with the on-campus cafeteria and includes responses for the variety and quality of the food served, and the cafeteria’s hours of operation. We can read the dataset using the haven package (https://haven.tidyverse.org/), like so:

cafeteria <- read_sav("./data/cafeteria_survey.sav")

Our third data frame looks like this:

studentid.  variety   quality  hours 
 123  3  4  2
 345  4  4  2
 385  1  1  2
 481  2  3  3
 624  2  3  4
 851  5  4  1
 992  4  5  3

Combining the Data

Now that we have data from all three sources, we want to combine them. Fortunately,  the data frames contain a common variable: the student id number (although our students data frame calls the column stuid and the other two call it studentid). We want to join the data frames based on the common column, in much the same way as our SQL query combined the students and majors information with one important difference. The previous SQL query used an inner join, which only included rows of data that had matches in both tables. All of the students had majors, so we didn’t exclude anyone. However, if you compare the three data frames, not every student signed up for intramural athletics or took the survey and some did neither. If we want to include all students in our final dataset, we need to use a left join instead. This type of join will include every row from the dataset listed on the left-hand side of the join statement and include the matching rows from the dataset on the right-hand side of the statement if they exist.

Since we’re doing this in R, we don’t have to use SQL syntax (although if you really wanted to, you could use the sqldf package - see https://cran.r-project.org/package=sqldf). Instead, we’re going to use the left_join function in the dplyr package (see https://dplyr.tidyverse.org/reference/join.html). We specify that stuid and studentid are the two fields being matched. To merge three data frames, we use two consecutive left_join statements; the first joins the students and intramurals data frames, and the second joins the result of the first join with the cafeteria data frame.

# Merge all three data sets into one using the student ID number
combined_data <- students %>%
  left_join(intramurals, by=c("stuid"="studentid")) %>%
  left_join(cafeteria, by=c("stuid"="studentid"))

Here is the output of our two left joins. The resulting data frame includes all students, with NA (missing) values if a student did not have a match in the intramurals or cafeteria data frames.

 stuid 

majorname

sport

 variety

   quality

   hours 

123

Mathematics

basketball

3

4

123

Mathematics

badminton

3

4

345

Computer Science   

NA

4

4

624

History

NA

2

3

789

Biology

basketball

NA

NA

NA 

851

Computer Science 

volleyball

5

4

851

Computer Science

badminton

5

4

555

History

volleyball

NA

NA

NA 

992

Biology

badminton

4

5

992

Biology

volleyball

4

5

484

Music

NA

NA

NA

NA 

717

English

basketball

NA

NA

NA 

839

Biology

NA

NA

NA

NA 

481

Music

basketball

2

3

989

English

NA

NA

NA

NA 

385

Mathematics

basketball

1

1

385

Mathematics

volleyball

1

1

222

Computer Science

badminton

NA

NA

NA 


Exporting the Data

What happens next? That’s up to you! You can use student demographic information to predict the degree of interest in future intramural offerings, or you can explore whether members of a particular intramural sports team have a statistically significant difference in opinion about the cafeteria food compared to their peers.

Whatever research question you pursue, I would keep this combined dataset in R and use its programming commands to analyze and visualize the data, but your IR office may have a completely different workflow and prefer different tools. That’s OK! We can use the same R packages to export the data into other formats for use with other software packages.

# Write the merged results to a specific program's file format using functions in the haven package:
write_sav(combined_data, "./exports/combined_data.sav")      # creates an SPSS output file
write_sas(combined_data, "./exports/combined_data.sas7bdat") # creates a SAS output file
write_dta(combined_data, "./exports/combined_data.dta")      # creates a Stata output file

If we want to export our data to Excel, we need to use a different package. There are multiple choices for accomplishing this task; one of them is the write_xlsx function in the writexl package (see https://cran.r-project.org/package=writexl):

write_xlsx(combined_data, path = "./exports/combined_data.xlsx")


Conclusion

One of the things I like most about R is its ability to be a “Swiss Army knife” for reading, manipulating, combining, and exporting data. In this article, we’ve seen how R can interact with several common data formats. Even if your IR office uses other software packages or file formats, R is a useful tool for importing and exporting your data to ensure compatibility across your entire workflow.

If you have a free RStudio Cloud account and would like to see the code and input files used for this Tech Tip, you may access them at https://rstudio.cloud/project/1976247.