SPSS: Using LAG to Identify/Remove Duplicates

By Dale Amburgey, Assistant Director, Institutional Research, Embry-Riddle Aeronautical University
 
​This tip outlines how to use the LAG function in SPSS syntax to identify and remove duplicates.
 
Seeing as most of the eAIR Tech Tips are related to the usage of Excel, I thought I’d take a different path with this tip. Mastery of SPSS syntax has long been a personal goal of mine because of the inherent analytical possibilities. Using SPSS syntax allows for customization of an analysis, the ability to document a process, and consistency in output.  This tip utilizes the SPSS LAG function to assist in the identification and removal of duplicates in a data set. I like this approach because I can insert it into any of my SPSS syntax to assist with duplicate identification.
 
We’ll work with a sample list of names and ID numbers with the goal of identifying and removing the duplicate records.  First, the data is imported into SPSS from an Excel worksheet. Then we open a syntax window by selecting the following path: File > New > Syntax (Figure 1).
 1SPSS.JPGFigure 1
 
The next step is to sort the data in ascending order by the ID number. In the SPSS syntax window, we will add the documentation first and then use the following syntax command (Figure 2):
 
SORT CASES ID (A).
EXECUTE.
 
After the command is entered, you may either highlight the command or click the cursor in any part of the command and select Run Selection (the green triangle icon).
 
2SPSS.JPG
Figure 2

3SPSS.JPG

Figure 3

Now, the data is sorted in an ascending order by ID number, and we can see that there are several duplicate entries (Figure 3).Creating the LAG variable returns a value associated with the previous case. In this example, you want to return the ID number for the previous case in our selection in a new variable called Previous_ID. In the syntax window, enter the following command and then click Run Selection (Figure 4):

COMPUTE Previous_ID = LAG(ID).
EXECUTE.

4SPSS.JPG

Figure 4

5SPSS.JPG
Figure 5

In the output, you will notice that the ID number for each previous case is now listed. Also, notice how the LAG function omits the previous value for the first case in your data set. We will make accommodations for the missing data in the next step (Figure 5).

We can use the ID number and the Previous_ID together to determine if a duplicate value exists because the two values should not be equal. We will create a new variable called Duplicate by using the following command and clicking Run Selection (Figure 6).

COMPUTE Duplicate = ID= Previous_ID.
IF MISSING (Duplicate) Duplicate = 0.
EXECUTE.

6SPSS.JPG
Figure 6

7SPSS.JPG
Figure 7

You will notice that we added a MISSING statement to the command that allows us to accommodate for the missing Previous_ID in the first case by calculating a 0 value. Now, all of the cases have a duplicate indicator of 1/0 that we can use to remove duplicated cases (Figure 7).

The last step is to remove all the duplicate cases so that we are left with only the unique values in our data set. To do so, we will use the following command syntax and click Run Selection (Figure 8).

SELECT IF Duplicate = 0.
EXECUTE.

8SPSS.JPG
Figure 8

9SPSS.JPG
Figure 9

As we can see in Figure 9, the unique values contained within the data set are all that remain. This example removed all of the duplicate values, but there are other syntax commands that would allow you to filter the results instead of removing data. Your individual needs would dictate the method you should follow.

This is one of numerous ways to identify and remove duplicates within SPSS. Hopefully, you will use this example as a catalyst to involve SPSS syntax to a greater degree within your data analysis.

 

 Comments

 
To add a comment, Sign In
Total Comments: 5
 
Jonathan posted on 4/14/2016 2:27 PM
It is nice to see another SPSS user - thanks for the tip. SPSS also has an "Identify Duplicate Cases" function that allows you to do the same thing (available under "Data" on the toolbar). One can plug the information in the dialog box (enter the ID in the "define matching cases by" box) and then paste the syntax. The syntax is longer than your example, but I just copy it from other syntax files and make changes when necessary.
Eileen posted on 4/14/2016 3:08 PM
I agree -- love to see SPSS users. I'm glad to see more methods to remove duplicates. I've always used the Restructure Data feature under the Data menu in which you can create a single case from multiple variables, or conversely make multiple cases from a single variable. An example of use would be if you had a file of multiple course records per student and wanted to see the records by student. The Restructure Wizard walks you through, but the simple syntax is

SORT CASES BY SCHOOLID .
CASESTOVARS
/ID=SCHOOLID
/GROUPBY=VARIABLE.

Where SCHOOLID is the variable for which you want to remove duplicates.
Alex posted on 4/16/2016 2:43 PM
Agreed on nice to see SPSS users. Wouldn't nearly be as effective and efficient without being able to use SPSS with an ODBC connection to our Student Information System. I wonder how many IR departments use SPSS or similar tools to directly pull their data. I utilize the identify duplicate cases function Jonathan mentioned all the time. This is however a great example and is given me some additional ideas on how to use the LAG function.
Emily posted on 8/4/2016 6:58 PM
I agree-- it's so nice to see other SPSS Syntax users! I utilize the following syntax to identify duplicate cases:

Sort Cases by ID.
Match Files file = *
/first = first /by ID.
Execute.

In this case "ID" is the variable for which you are identifying duplicates. You may also add more than one variable for which you would like to identify duplicates (e.g., ID term).
Komla posted on 1/21/2017 5:29 PM
Thanks for exposing me to yet another flexible function within SPSS. My problem is not one of eliminating variables but to lag my independent variables in a predictive regression model. I am modeling cross-sectional panel data involving several firms across ten years of financial statement information. Without physically shifting all rows of data downward, how would I use the lag function to lag the IVs within a multilevel regression model? Is this an unreasonable expectation?

Thanks.