Combining Data from Multiple Sources Using R

Gathering the Data

A common request of our institutional research office is to combine data from multiple sources and varying formats. For example, we keep our demographic and academic information about our students in an SQL database, but another office on campus might keep track of their student program signups in Excel. Furthermore, maybe we just received some raw survey data in SPSS format. How can we make these three very different data formats work together?

There are a variety of tools that can accomplish this task, both free and commercial, but my favorite technique is to write a short R script. There are so many R packages for accessing data from different sources that I have yet to run into a file format that I can’t read using R. In this Tech Tip, we’re going to focus on three formats commonly used in IR work: an SQL database, Excel, and SPSS.

Before we do anything else, we’ll load the five R packages we’re going to use (if you’ve never used them before, you’ll have to install them first). The following code accomplishes this:

library(DBI) # for connecting to databases library(readxl) # for reading Excel files library(writexl) # for creating new Excel files library(haven) # for reading files from SPSS, SAS, and Stata library(dplyr) # for combining the data sets

First, let’s query the SQL database. R has several packages for connecting to databases. In this example, we will use the DBI package (see https://db.rstudio.com/dbi/). The DBI package supports connecting to a variety of database types, both commercial and open source. You may need to consult with your IT department to determine the proper database driver and connection string to use. In this example, we are using a SQLite database for simplicity (see https://db.rstudio.com/databases/sqlite/).

In our database, there are two tables, one containing information about students, and one containing information about the majors offered at the institution. In the student table, rather than wasting space storing the full name of each student’s major, we only store the three-letter code. If we want to include the majors’ full titles for use in a table or graph, we need to combine the two tables using the field they have in common (the major code). In SQL terms, this is called a join.

# Establish a connection to our SQLite database con <- dbConnect(RSQLite::SQLite(), "./data/students.db") # Write our query using SQL syntax query <- "SELECT Students.stuid, Majors.majorname FROM Students INNER JOIN Majors ON Students.majorid = Majors.majorid" # Execute the query, fetch the results, and store them in a data frame students <- dbFetch(dbSendQuery(con, query))

The resulting data frame looks like this:

stuid	majorname
123	Mathematics
345	Computer Science
624	History
789	Biology
851	Computer Science
555	History
992	Biology
484	Music
717	English
839	Biology
481	Music
989	English
385	Mathematics
222	Computer Science

Next, let’s say that our student activities department uses Excel to maintain a list of students signed up for intramural athletics. Once they’ve shared that file with us, we can use the read_excel function found in the readxl package (see https://readxl.tidyverse.org/) to retrieve the data and store them in another data frame.

intramurals <- read_excel("./data/intramural_signups.xlsx")

That data frame looks like this:

studentid	sport
123	basketball
123	badminton
222	badminton
385	basketball
385	volleyball
481	basketball
555	volleyball
717	basketball
789	basketball
851	volleyball
851	badminton
992	badminton
992	volleyball

Similarly, let’s suppose we received raw data from a survey in SPSS format. The survey asked questions about students’ satisfaction with the on-campus cafeteria and includes responses for the variety and quality of the food served, and the cafeteria’s hours of operation. We can read the dataset using the haven package (https://haven.tidyverse.org/), like so:

cafeteria <- read_sav("./data/cafeteria_survey.sav")

Our third data frame looks like this:

studentid.	variety	quality	hours
123	3	4	2
345	4	4	2
385	1	1	2
481	2	3	3
624	2	3	4
851	5	4	1
992	4	5	3

Combining the Data

Now that we have data from all three sources, we want to combine them. Fortunately, the data frames contain a common variable: the student id number (although our students data frame calls the column stuid and the other two call it studentid). We want to join the data frames based on the common column, in much the same way as our SQL query combined the students and majors information with one important difference. The previous SQL query used an inner join, which only included rows of data that had matches in both tables. All of the students had majors, so we didn’t exclude anyone. However, if you compare the three data frames, not every student signed up for intramural athletics or took the survey and some did neither. If we want to include all students in our final dataset, we need to use a left join instead. This type of join will include every row from the dataset listed on the left-hand side of the join statement and include the matching rows from the dataset on the right-hand side of the statement if they exist.

Since we’re doing this in R, we don’t have to use SQL syntax (although if you really wanted to, you could use the sqldf package - see https://cran.r-project.org/package=sqldf). Instead, we’re going to use the left_join function in the dplyr package (see https://dplyr.tidyverse.org/reference/join.html). We specify that stuid and studentid are the two fields being matched. To merge three data frames, we use two consecutive left_join statements; the first joins the students and intramurals data frames, and the second joins the result of the first join with the cafeteria data frame.

# Merge all three data sets into one using the student ID number combined_data <- students %>% left_join(intramurals, by=c("stuid"="studentid")) %>% left_join(cafeteria, by=c("stuid"="studentid"))

Here is the output of our two left joins. The resulting data frame includes all students, with NA (missing) values if a student did not have a match in the intramurals or cafeteria data frames.

stuid	majorname	sport	variety	quality	hours
123	Mathematics	basketball	3	4	2
123	Mathematics	badminton	3	4	2
345	Computer Science	NA	4	4	2
624	History	NA	2	3	4
789	Biology	basketball	NA	NA	NA
851	Computer Science	volleyball	5	4	1
851	Computer Science	badminton	5	4	1
555	History	volleyball	NA	NA	NA
992	Biology	badminton	4	5	3
992	Biology	volleyball	4	5	3
484	Music	NA	NA	NA	NA
717	English	basketball	NA	NA	NA
839	Biology	NA	NA	NA	NA
481	Music	basketball	2	3	3
989	English	NA	NA	NA	NA
385	Mathematics	basketball	1	1	2
385	Mathematics	volleyball	1	1	2
222	Computer Science	badminton	NA	NA	NA

Exporting the Data

What happens next? That’s up to you! You can use student demographic information to predict the degree of interest in future intramural offerings, or you can explore whether members of a particular intramural sports team have a statistically significant difference in opinion about the cafeteria food compared to their peers.

Whatever research question you pursue, I would keep this combined dataset in R and use its programming commands to analyze and visualize the data, but your IR office may have a completely different workflow and prefer different tools. That’s OK! We can use the same R packages to export the data into other formats for use with other software packages.

# Write the merged results to a specific program's file format using functions in the haven package: write_sav(combined_data, "./exports/combined_data.sav") # creates an SPSS output file write_sas(combined_data, "./exports/combined_data.sas7bdat") # creates a SAS output file write_dta(combined_data, "./exports/combined_data.dta") # creates a Stata output file

If we want to export our data to Excel, we need to use a different package. There are multiple choices for accomplishing this task; one of them is the write_xlsx function in the writexl package (see https://cran.r-project.org/package=writexl):

write_xlsx(combined_data, path = "./exports/combined_data.xlsx")

Conclusion

One of the things I like most about R is its ability to be a “Swiss Army knife” for reading, manipulating, combining, and exporting data. In this article, we’ve seen how R can interact with several common data formats. Even if your IR office uses other software packages or file formats, R is a useful tool for importing and exporting your data to ensure compatibility across your entire workflow.

If you have a free RStudio Cloud account and would like to see the code and input files used for this Tech Tip, you may access them at https://rstudio.cloud/project/1976247.

Tech Tips

Combining Data from Multiple Sources Using R

Gathering the Data

Combining the Data

Exporting the Data

Conclusion

About eAIR

View Articles By Column

Subscribe

Contribute

Thank You!