Data Validation in R

Brendan J. Dugan, Research Analyst, Center for Postsecondary Research, Indiana University

Introduction

There are many ways that mistakes can be introduced when using data, leading to wasted time and much frustration. Using different recoded variables, drawing from different data sources, and dealing with a large number of cases, for instance, can allow errors to creep into reports and potentially go unnoticed.

This eAIR Tech Tip demonstrates how to take advantage of R for checking the accuracy of data in a raw format, such as CSV, SPSS, or Excel files, by comparing two separately produced data files. Those interested in learning how to use R for the first time should consider exploring this eAIR Tech Tip, the Resources section, or searching the web for introductions to R.

This process generally follows the same steps that someone might if checking two tables in Excel: comparing corresponding cells against each other, either visually or by logical tests, throughout the entire table, and then repeating this process for all the tables, to ensure they all match. However, this process quickly becomes burdensome, if not infeasible, if the tables are large, or there are many reports to check, or if one must do it often or on-the-fly. Using R, this process becomes easily scalable and adaptable across situations.

Importing Data

For the sake of demonstration, these files contain randomly generated data for several variables: sex, age, SAT math and verbal scores, and GPA. There are ten cases in each of the ten files, representing some generic group (schools, majors, etc.). The basic idea is that the cases in the one output file, named primary, ought to match the cases in the corresponding checksheet. We can see this is true for a few cases (step 3), but visually checking data becomes infeasible when there are many cases, variables, or reports to check.

1. Set up our directory using assignment; here we will need our specific folder. This establishes the directory where our data reside.

2. Make a list of files in specified folder and show them by reading everything in this particular directory with dir().

suppressMessages(library(R.utils))
suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
myDir <- "./sample data/"

print(myFiles <- dir(myDir))
##  [1] "checksheet_1.xlsx"  "checksheet_10.xlsx" "checksheet_2.xlsx"
##  [4] "checksheet_3.xlsx"  "checksheet_4.xlsx"  "checksheet_5.xlsx"
##  [7] "checksheet_6.xlsx"  "checksheet_7.xlsx"  "checksheet_8.xlsx"
## [10] "checksheet_9.xlsx"  "primary_1.xlsx"     "primary_10.xlsx"  
## [13] "primary_2.xlsx"     "primary_3.xlsx"     "primary_4.xlsx"   
## [16] "primary_5.xlsx"     "primary_6.xlsx"     "primary_7.xlsx"   
## [19] "primary_8.xlsx"     "primary_9.xlsx"
3.           Preview a pair of files with head().
head(checksheet_1)
##   sex age SATmath SATreading   GPA
## 1   M  39     323        359 2.705
## 2   F  22     306        423 2.999
## 3   F  32     612        544 2.734
## 4   M  20     430        745 1.965
## 5   F  23     662        321 3.569
## 6   M  26     499        739 3.171
head(primary_1)
##   sex age SATmath SATreading   GPA
## 1   M  39     323        359 2.705
## 2   F  22     306        423 2.999
## 3   F  32     612        544 2.734
## 4   M  20     430        745 1.965
## 5   F  23     662        321 3.569
## 6   M  26     499        739 3.171

 

3. Split the list of files into checksheets and those to be checked by filtering using regular expressions and grep().

a)  A regular expression is a pattern of characters (letters, punctuation, etc.) describing string objects 1. The usefulness of regular expressions lies in that we can specify both a constant string (“checksheet_”) as well as a variable (between one and two digits, indicated by \\d{1,2}), and much more.

b)  grep() takes as its arguments a regular expression – here, a file name – and where to find it, and if we want the name or its index returned as output. (This is similar in ways to Excel’s INDEX()). This allows us to cycle through a list of numbered files with similar names and separate them accordingly, but would not be necessary if the files are already segregated in different folders.

checksheets <- grep("checksheet_\\d{1,2}.xlsx", myFiles, value = TRUE)
primary_files <- grep("primary_\\d{1,2}.xlsx", myFiles, value = TRUE)

# primary_files is a list filtered with grep():
primary_files
##  [1] "primary_1.xlsx"  "primary_10.xlsx" "primary_2.xlsx"
##  [4] "primary_3.xlsx"  "primary_4.xlsx"  "primary_5.xlsx"
##  [7] "primary_6.xlsx"  "primary_7.xlsx"  "primary_8.xlsx"
## [10] "primary_9.xlsx"

Comparing Data

1. Make a data frame (table) with three columns and the same number of rows as the number of files we need to check. We name the three columns as well. This table will eventually hold our results.

nfiles <- 1:length(checksheets)
results <- data.frame(nfiles, nfiles, nfiles)
colnames(results) <- c("Checksheet","Primary File","% Match")


Now we can start comparing files for each pair by iterating over the lists of files, i.e., performing the same set of tasks for the first pair, the second, and so on.

2. Read in the Excel files, referring to them in our directory and in their respective lists.

3. Round GPA to the second decimal position, in case negligible differences exist; this is equivalent to using Excel’s ROUND(x, 2) and allows the following test to evaluate correctly. Notice that we can reference columns by either their indexed location (data[row, column]) or by their names (data["column name"]).

4. Check the data, one table against another. The number of matches should equal the product of the dimensions of table. (Here, a and b are just scratch variables held in memory).

5. Identify the files that do not match by conditionally subsetting them.

6. Print a message with some diagnostic details when the loop is finished checking.

for (i in 1:length(checksheets)) {

 
# load the Excel data into R. paste0() concatenates the folder and filename for the given iteration.
  fileCheck <- read_excel(paste0(myDir, checksheets[i]))
  filePrime <- read_excel(paste0(myDir, primary_files[i]))

 
# Round columns 2 to 5, or age, SATMath, SATReading, and GPA.
  fileCheck[,2:5] <- round(fileCheck[,2:5], 2)
  filePrime[,c("age", "SATmath", "SATreading", "GPA")] <- round(
    filePrime[,c("age", "SATmath", "SATreading", "GPA")], 2)

  a <- sum(fileCheck == filePrime)
# summing the number of matching cells
  b <- (nrow(fileCheck) * ncol(filePrime))
# the area of the table

 
# Write results in the corresponding row.
  results[i,"Checksheet"] <- checksheets[i]
# name of the check file
  results[i,"Primary File"] <- primary_files[i]
# name of the file to check
  results[i,"% Match"] <- a / b * 100
# the % of matching cells
  mismatches <- results[which(results$`% Match` < 100),]

}

print(paste0("FINISHED -- ",
             sum(results$`% Match`) / nrow(results),
             "% match."))
## [1] "FINISHED -- 96.6% match."
if (nrow(mismatches) > 0) {
  print(mismatches)
}
##          Checksheet   Primary File % Match
## 4 checksheet_3.xlsx primary_3.xlsx      82
## 8 checksheet_7.xlsx primary_7.xlsx      84

Troubleshooting Errors

Voilà! Files 3 and 7 do not match their checksheets, so we can investigate the data further and resolve the issues.

1. Read in data we know to be amiss.

2. Compare contents of cells in our tables. Here, it’s clear that GPA doesn’t match, and we’ll need to determine why.

3. Repeat for file 7.

checksheet_3 <- read_excel(paste0(myDir, "checksheet_3.xlsx"))
primary_3 <- read_excel(paste0(myDir, "primary_3.xlsx"))
checksheet_3 == primary_3
##        sex  age SATmath SATreading   GPA
##  [1,] TRUE TRUE    TRUE       TRUE FALSE
##  [2,] TRUE TRUE    TRUE       TRUE FALSE
##  [3,] TRUE TRUE    TRUE       TRUE FALSE
##  [4,] TRUE TRUE    TRUE       TRUE FALSE
##  [5,] TRUE TRUE    TRUE       TRUE FALSE
##  [6,] TRUE TRUE    TRUE       TRUE FALSE
##  [7,] TRUE TRUE    TRUE       TRUE FALSE
##  [8,] TRUE TRUE    TRUE       TRUE FALSE
##  [9,] TRUE TRUE    TRUE       TRUE  TRUE
## [10,] TRUE TRUE    TRUE       TRUE FALSE
head(checksheet_3)
## # A tibble: 6 x 5
##     sex   age SATmath SATreading   GPA
##   <chr> <dbl>   <dbl>      <dbl> <dbl>
## 1     M    22     507        300 2.448
## 2     F    17     503        685 2.433
## 3     F    20     520        431 1.925
## 4     M    19     534        396 2.633
## 5     F    22     721        561 2.146
## 6     M    35     698        563 2.341
head(primary_3)
## # A tibble: 6 x 5
##     sex   age SATmath SATreading   GPA
##   <chr> <dbl>   <dbl>      <dbl> <dbl>
## 1     M    22     507        300 1.925
## 2     F    17     503        685 2.005
## 3     F    20     520        431 2.019
## 4     M    19     534        396 2.146
## 5     F    22     721        561 2.341
## 6     M    35     698        563 2.433

Conclusion

This method can be extended to verify the accuracy of reports in other situations as well. For example, one could compare data over time by identifying relatively large differences between semesters, academic years, or the last data extract, rather than checking if the values simply match. One could also compare data housed in a database against other records to ensure validity.

If a table exists in a heavily formatted Excel file (i.e., with text, images, tables, and charts for printing), the read_excel() function can read data from a specified Excel-style cell range:

  cashFlowData <- read_excel("C:/.../sample.xlsx",
    range = "My Report!C31:N33",
    col_names = FALSE)

grep() can be modified to incoporate 4- or 6-digit CIP codes: e.g., grep("45.\\d{2,4}.csv", x, value = TRUE) will yield all Social Sciences CSV files in some list of files x.

Ensuring data and reporting accuracy is a common duty of IR and IE professionals. With R, this can become much easier, scalable to large tasks, reliably reproducible, and adaptable for different circumstances.

Resources

 

 Comments

 
To add a comment, Sign In
Total Comments: 1
 
Gary posted on 2/7/2018 1:37 PM
Interesting tip - thanks for the contribution!