Row-Level Data Containing Duplicate ID’s

​Michael Johnston, Director of Institutional Research, Pensacola State College

There are circumstances where datasets have variables or combinations of variables that are not completely unique. Unique identifiers are expected to exist in duplicate in some datasets; however, they can make aggregation and calculations very difficult, especially when other values conflict. Most statistical software packages include a non-duplicate count function allowing a unique identifier to count only once. However, in some circumstances a row of data will be removed when a duplicate exists, eliminating differences between outcomes. Often, this type of situation is seen in course record lists.  

2018may1.PNG

We see in the above list, there are three distinct unique identifiers across 9 rows of data. Questions are often asked about the data specific to unique identifiers, which in this case exists across rows. Suppose this question was asked: “How many students are taking at least one math course, indicated by an enrollment of a course with prefix MAC, MAT, or MAP?” The analysis may prove to be difficult because the calculation must be written to aggregate vertically controlling for the Unique Identifier.

2018may2.PNG

The yellow highlighted courses in the above list are the math courses as described in the question. If a count calculation was applied to courses, the data would identify 1 MAC1105, 1 MAT1033, and 1 MAP2206 enrollment, totaling 3 math enrollments. However, student 9658 is enrolled in two math courses and should not be counted twice (per the question asking for at least one). Student 9749 is enrolled in one math course (the student should be counted once) and student 0100 is not enrolled in any math course (the student should not be counted at all). An option to transpose the data such that each unique identifier exists in only one row is an option, but that process is very cumbersome and time consuming.

An option to aggregate vertically controlling for one common identifier exists within Tableau using the set calculation. The set calculation will allow users to group unique identifiers based on an outcome. The process is as follows: first, import the data into Tableau, specifically with the Unique Identifier as a string variable (it may or may not default to a string variable).

2018may3.PNG

Second, a calculation must be written to extract the prefix of the courses (at least in this example, a dataset may already have the course prefix identified), therefore, create a calculated field to extract the course prefix. In order to write a calculation, open the data source pane and left click on the downward arrow on the upper right corner of the Unique Identifier variable (highlighted in yellow below). This downward arrow can be found (and used) on all variables in the data source pane, but for this example the arrow in Unique Identifier will be used.

2018may4.PNG
Once this downward arrow is identified, select “Create A Calculated Field…” and use the formula: left([Course],3). This will extract the first three characters beginning from the left in the Course variable.

2018may5.png
The new variable, in this example, is labeled Course Prefix and the formula will resemble the image below.

2018may6.png

The data will have a new variable with the first three characters of the Course identified. Note in the image below, Tableau identified the new variable as a calculation above the variable name Course Prefix.

2018may7.png

Third, create a set using the following formula:

2018may8.png

This formula will assign a value of 1 for every math prefix that exists and a value of 0 otherwise. The max statement will allow a count distinct function to group the students who took at least one math course separate from those that did not take a math course.

Many Unique Identifiers will consist of both numbers and letters. In this example, the Unique Identifier is four numbers that have no numerical value therefore it should be a string variable. However, Tableau will probably recognize them (in this circumstance) as a Measure (when in fact it is not). Therefore, confirm that the Unique Identifier is a dimension (if the variable is not a dimension, drag the Unique Identifier pill from measures to dimension). Once confirmed, drag the Unique Identifier into rows and convert the pill to a distinct count.

 

2018may9.png

 

A distinct count of the unique identifier will not occur automatically. In order to determine a distinct count of the Unique Identifier, drag the Unique Identifier pill to the rows shelf. Once the Unique Identifier is placed in the rows shelf, right click on the pill, select measure, and then select Count (Distinct). Once completed successfully the color of the Unique Identifier pill will change from blue to green.

2018may10.png
This will guarantee that each unique identifier only counts once. Next, drag the Math Enrollment Set into Columns as shown below.

2018may11.png
In the visualization, there are 2 students that categorize as “In” and then 1 student that categorizes as “Out”. This parallels the 2 students that are enrolled in at least one math course and the 1 student that was not taking any math courses. This prevents the student taking two math courses from being counted twice giving an accurate answer to the question asked.

In conclusion, circumstances where non-unique row-level data exists do not need to be linearized horizontally where each row is unique. The set calculation allows the user to control for a unique variable while performing analysis aggregating outcome variables that are different.

 

 Comments

 
To add a comment, Sign In
There are no comments.