Importing/Exporting Fixed-Width Data with R

​By Blake Madden, Institutional Data Analyst, Edison State Community College

A common format used for reporting data is fixed-width columns. In this format, columns are set to a specific width, with no delimiters between them. This is an optimal way to store data, but can prove to be difficult when importing and exporting it. For example, most spreadsheet applications (e.g., Microsoft® Excel) offer fixed-width importing, but using this feature is tedious and prone to errors. Likewise, exporting tabular data into columns of specific widths requires truncating and padding each column—an equally arduous process.

In this article, we will discuss more elegant solutions which use R to entirely automate these processes.

Importing Fixed-Width Data

As an example, we will import some fictitious data similar to what we may receive from a job-services agency. (Data such as this may be cross-referenced with students who have graduated to track success in their respective fields.) We will parse its columns into tabular data, remove columns that we don’t need, reformat the data, and finally rearrange the columns.

Before delving into any R code, let’s first review how to import this type of data. The steps for importing fixed-width data are as follows:

  1. Understand the file specification

  2. Decide which columns to import (and which to ignore)

  3. Review any idiosyncrasies that may occur in the data

  4. Plan how to transform and reformat the data

  5. Decide how to reorder the columns (this may be optional)

In our example, the input file will appear as such:

tt2nov2018.png

This data’s format specification is:

  • Column 1: 6 spaces (a placeholder)
  • Column 2: the person’s social security number (9 characters)
  • Column 3: the person’s last name, which will be 10 characters
  • Column 4: the person’s first name, which will be 10 characters
  • Column 5: the company that the person currently works for (25 characters)
  • Column 6: the letter ‘Y’ (a placeholder/control character)
  • Column 7: the year that he/she started working there (8 characters)
  • Columns 8 through 11: his/her pay for quarters 4 through 1 for the last year (6 characters, each)
  • Column 12: the letter ‘X’ (a placeholder/control character)

At this point, we will decide which columns we need to import, as well as how to import them. This is the import logic that we will use:

  • Column 1: skip this
  • Columns 2 through 5: import these as text (also, we will combine the person’s first and last name into one column)
  • Column 6: skip this
  • Column 7: import as a date
  • Columns 8 through 11: import as integers (something to note is that if no income is available, then these columns are filled with "XXXXXX")
  • Column 12: skip this

Now that we have our specifications, we will began coding it in R. The first step is to define the required libraries:

suppressMessages(library(readr))     #used for read_fwf function
suppressMessages(
library(lubridate)) #used for reformatting dates
suppressMessages(
library(dplyr))     #data transformation functions
suppressMessages(
library(magrittr))  #used for pipe (data assignment) operations
suppressMessages(
library(stringr))   #string manipulation functions

Next, we will import our data from "C:/data/EmploymentData.txt" with the read_fwf() function (R: Read Fixed Width Format Files). Within this function, we name the columns and specify how many characters they are. After importing the data, it will be assigned to a dataset named EmploymentData:

EmploymentData <- read_fwf(file = "C:/data/EmploymentData.txt",
                           col_positions = fwf_cols(
                               placeholder =
6,
                               SSN =
9,
                               LastName =
10,
                               FirstName =
10,
                               Company =
25,
                               ctrl_char =
1,
                               DateOfHire =
8,
                               Q4_Pay =
6,
                               Q3_Pay =
6,
                               Q2_Pay =
6,
                               Q1_Pay =
6,
                               ctrl_char2 =
1) ,
                            na=
"XXXXXX"
)

Note that the final step in our call to read_fwf() involves dealing with missing data in our quarterly payment columns. According to the file’s specification, missing data in these columns will be written as XXXXXX. To handle this, the argument na="XXXXXX" instructs R to treat these strings as missing data.

At this point, we will have our employment data; however, we may still need to apply some additional transformations to get the data into a format that we can work with. The first step to transforming data after importing it is to “pipe” (i.e., reroute) the dataset to a series of downstream functions. To pipe our data, add a %>% after the call to read_fwf():


na=
"XXXXXX") %>%

Now, we can add our various post-import functions to edit the data. In R, a useful method for transforming and creating columns is the mutate() function (Add New Variables). (In some ways, mutate() can be thought of as a spreadsheet formula in Excel.) Our first operation will be to convert the hire-date column from a string to an actual date:

#convert string to yyyy-mm-dd date
mutate(DateOfHire = ymd(DateOfHire)) %>%

Note that we included another %>% after this function call, meaning that the output from this operation will be rerouted to yet another transformation function.

Because we imported the alphanumeric columns using specific column widths, some of our text values (e.g., company names) may have unnecessary whitespace after them. To remove this, we can apply the str_trim() function (Trim whitespace from a string) to each of these columns:

#trim whitespace from columns
mutate(LastName=str_trim(LastName), FirstName=str_trim(FirstName),
           Company = str_trim(Company)) %>%

The next change to make is to combine the individuals’ first and last names into one column.

#create a new column, Name, formatted as "Last, First"
mutate(Name = sprintf("%s, %s", LastName, FirstName)) %>%

Finally, remove the LastName and FirstName columns (since we now have a Name column to include this information), as well as rearrange the quarterly payment columns into a more logical order. The select() function (Select/rename variables by name) enables us to pick which columns to include in our final output, as well as their order:

#select which columns to keep, and reorder them
dplyr::select(SSN, Name, Company, DateOfHire,
           Q1_Pay, Q2_Pay, Q3_Pay, Q4_Pay)

In the above code, we have requested only the columns that we want in the final dataset, as well as rearranging the payment columns in a Q1…Q4 order.

Finally, we will have our employment dataset. This data can now be viewed by either entering its name (EmploymentData) or by calling View(EmploymentData) (if you are using RStudio).

At this point, we can export this data to a tab-delimited file by using the write.table() function (R: Data Output).

#tab-delimited, non-quoted output
write.table(EmploymentData,
"C:/data/EmploymentDataTabbed.txt",
                sep=
"\t", quote=FALSE, row.names = FALSE, na = "")

By creating this R script, we will now be able to completely automate importing any future fixed-width data (in this particular format) and exporting it to a tab-delimited file.

The full R script is as follows:

suppressMessages(library(readr))     #used for read_fwf function
suppressMessages(library(lubridate)) #used for reformatting dates
suppressMessages(library(dplyr))     #data transformation functions
suppressMessages(library(magrittr))  #used for pipe (data assignment) operations
suppressMessages(library(stringr))   #string manipulation functions
# data prior to reading-in:
#      555555551Dexter    Charles   SELF-EMPLOYED            Y20160701012000011200010925009450X
#      555555552Dyer      William   Miskatonic University    Y20140822023100021500019930018500X
#      555555553West      Herbert   Miskatonic University    Y20040605XXXXXXXXXXXXXXXXXXXXXXXXX
EmploymentData <- read_fwf(file = "C:/data/EmploymentData.txt",
                           col_positions = fwf_cols(
                               placeholder =
6,
                               SSN =
9,
                               LastName =
10,
                               FirstName =
10,
                               Company =
25,
                               ctrl_char =
1,
                               DateOfHire =
8,
                               Q4_Pay =
6,
                               Q3_Pay =
6,
                               Q2_Pay =
6,
                               Q1_Pay =
6,
                               ctrl_char2 =
1),
                               na=
"XXXXXX") %>%
                    
#convert string to yyyy-mm-dd date
                    mutate(DateOfHire = ymd(DateOfHire)) %>%
                   
#trim whitespace from columns
                    mutate(LastName=str_trim(LastName), FirstName=str_trim(FirstName),
                           Company = str_trim(Company)) %>%
                    #create a new column, Name, formatted as "Last, First"
                    mutate(Name = sprintf(
"%s, %s", LastName, FirstName)) %>%
                   
#select which columns to keep, and reorder them
                    dplyr::select(SSN, Name, Company, DateOfHire,
                           Q1_Pay, Q2_Pay, Q3_Pay, Q4_Pay)
#tab-delimited, non-quoted output
write.table(EmploymentData, "C:/data/EmploymentDataTabbed.txt",
                    sep=
"\t", quote=FALSE, row.names = FALSE, na = "")
# output will appear as such:
#SSN    Name    Company DateOfHire  Q1_Pay  Q2_Pay  Q3_Pay  Q4_Pay
#555555551  Dexter, Charles SELF-EMPLOYED   2016-07-01  009450  010925  011200  012000
#555555552  Dyer, William   Miskatonic University   2014-08-22  018500  019930  021500  023100
#555555553  West, Herbert   Miskatonic University   2004-06-05

Exporting Fixed-Width Data

Along with importing fixed-width data, sometimes we may need to convert our data into this format as well. For example, state agencies often require data reports to be uploaded as fixed-width data.

For exporting tabular data into fixed-width format, the major issue to deal with is ensuring that each column is the proper width. This will involve:

  1. Truncating text that is too long for its column

  2. Whitespace (or zero) padding data so that it fills its entire column (this will also require knowing which side of the data should be padded)

Consider the following tab-delimited data:

tt1nov2018.png

Our goal will be to import this and then export it with the following specification:

     1.  A header row which consists of these columns:

  • a.   The name of our college (20 characters, should be right padded with spaces)
  • b  The report year (4 characters)
  • c.   The report semester code (2 characters)
  • d.   The number of records in the report (a number, 4 characters wide, should be filled with leading zeros)

     2.  The rows of each section taught (from our tabular data), consisting of these columns:

  • a.   The name of the course (30 characters, should be right padded with spaces)
  • b.   The number of sections taught (a number, 2 characters wide, should be filled with leading zeros)
  • c.   The credits earned from each section (a number, 4 characters wide, should be filled with leading zeros)

First, define the information used for the header:

AcademicYear <- 2018        #Report year
ReportingSemester <- "AU"   #Report semester (autumn)
College <- "INNSMOUTH COLLEGE"

Next, import the tab-delimited data using the read.delim() function:

CoursesTaught <- read.delim("C:/data/SchoolInput.txt")

Now, format the header that will be the first line of our output file:

Header <- sprintf("%s%d%s%04d", str_pad(College,20,"right"),
                  AcademicYear, ReportingSemester, nrow(CoursesTaught))

Here we are using the sprintf() function (R: Use C-Style String Formatting Commands) to format text and numbers into a single line of text, as well as controlling how some of this data are printed. sprintf() takes a format string as its first argument, where:

  • %s indicates a placeholder for a string

  • %d indicates a placeholder for an integer

  • %04d indicates a placeholder for an integer that needs to be 4 characters wide (and will be left-filled with zeros if shorter than that)

In the case of our format string "%s%d%s%04d", sprintf() will expect us to provide a string, integer, string, and a final integer. The first string that we provide is the college name, which will be right padded with spaces to a length of 20 using the str_pad() function. After that, we provide our AcademicYear variable (an integer), our ReportingSemester variable (a string), and finally the number of courses in the data that we imported. Something to note about this last argument is that we get the number of records from our data by calling the nrow() function on our dataset. Also, the %04d syntax in our format string will ensure that it is 4-characters wide and filled with leading zeros.

Our next step will be to format the rest of the data into a single array of text and then combine it with our header row. Although our dataset consists of 3 columns, our goal is to combine each row into a single string (where each column has been properly padded and aligned). To do this, we will again use the sprintf() and str_pad() functions to accomplish this; however, instead of calling these functions on single variables, we will use them on entire columns. (An advantage of the R language is that many of its functions are vectorized, meaning that they can be ran on both individual values and arrays.)

Recall that we are formatting these rows to be a 30-character course name, a 2-character section count, and a 4-character credit hour count. Hence, we will use the format string "%s%02d%04d" for sprintf(), and then provide the columns CourseName, SectionsTaught, and TotalCreditHours from our CoursesTaught dataset. Our call to sprintf() (which will convert our entire dataset to a text array) will appear like this:

sprintf("%s%02d%04d",
        
#pad the course name to 30 characters, and also  
         
#truncate it to 30 characters if too long
         str_trunc(str_pad(CoursesTaught$CourseName,
30,"right")
         ,
30,"right",ellipsis = ""),
         CoursesTaught$SectionsTaught, CoursesTaught$TotalCreditHours)

Next, combine the header row with the text data that sprintf() returns:

FinalReport <- c(Header,
                 sprintf(
"%s%02d%04d",
                 
#pad the course name to 30 characters, and also  
                  
#truncate it to 30 characters if too long
                  str_trunc(str_pad(CoursesTaught$CourseName,
30,"right")
                           ,
30,"right",ellipsis = ""),
                  CoursesTaught$SectionsTaught, CoursesTaught$TotalCreditHours))

Now that the header and data are in a fixed-column format, export it:

#convert text array to a 1-column dataset and export via write.table()
write.table(as.data.frame(FinalReport), "C:/data/FinalReport.txt",
            col.names =
FALSE, row.names = FALSE, quote = FALSE)

Our data are now properly formatted and ready to submit to our state agency. Also, we now have an R script to automate creating these reports for any future submissions.

The final script is as follows:

suppressMessages(library(stringr))   #string manipulation functions 
# data prior to reading-in:
#SectionsTaught CourseName  TotalCreditHours
#7  Marine Biology  195
#3  Hotel Management    54
AcademicYear <- 2018        #Report year
ReportingSemester <- "AU"   #Report semester (autumn)
College <- "INNSMOUTH COLLEGE"
CoursesTaught <- read.delim("C:/data/SchoolInput.txt")
Header <- sprintf("%s%d%s%04d", str_pad(College,20,"right"),
                  AcademicYear, ReportingSemester, nrow(CoursesTaught))
FinalReport <- c(Header,
                 sprintf(
"%s%02d%04d",
                 
#pad the course name to 30 characters, and also  
                  
#truncate it to 30 characters if too long
                  str_trunc(str_pad(CoursesTaught$CourseName,
30,"right")
                  ,
30,"right",ellipsis = ""),
                  CoursesTaught$SectionsTaught, CoursesTaught$TotalCreditHours))
#convert text array to a 1-column dataset and export via write.table()
write.table(as.data.frame(FinalReport), "C:/data/FinalReport.txt",
            col.names =
FALSE, row.names = FALSE, quote = FALSE)

# output will appear as such:
#INNSMOUTH COLLEGE   2018AU0002
#Marine Biology                070195
#Hotel Management              030054
 

 

 

 Comments

 
To add a comment, Sign In
Total Comments: 1
 
Gary posted on 11/28/2018 6:01 PM
Yay! Nice tip. Reading, writing, and modifying these types of files is an important skill in IR. Thanks for showing us how you accomplish it with R.