Useful Options in the SAS® Sort Procedure

This Tech Tip was provided by Onecia Gibson, Assistant Director Research and Analysis for the Office of Institutional Research, University of Kentucky.

The SAS® sort procedure is used to sort SAS® databases by specified variables.   This tech tip will illustrate two useful sorting options, nodupkeys and sortseq, and both will be presented using Base SAS® and SAS® Enterprise Guide.

 I. Sorting Data in SAS®

 Suppose you have an enrollment database and wish to sort the data by the students’ last and first names.  You might use the following code in Base SAS®. 

 proc sort data=enrollment;

    by Last_Name First_Name;

run;

To sort by last and first name in SAS® Enterprise Guide, open an SAS® database and choose Data . . . Sort Data.



 

In the sort window, highlight the variable “Last_Name,” click the right arrow, and choose “Sort by.”  Do the same for First_Name and click the “Run” button.



Below is the sorted database.



II. Removing Duplication Observations

Despite our best data manipulation skills, we sometimes find duplicate observations (records) in databases.  In the example above, notice that the student “Kyle Last_Name1” is duplicated.  An easy method by which to remove duplicate observations is by using the nodupkeys option in the Base SAS® sort procedure. 

proc sort data=enrollment nodupkeys;

    by _all_;

run;

Using the variable identifier _all_ with the nodupkeys option tells SAS® to remove the duplicate observations where all variables are identical to a previous observation.  If you need to know which observations were removed, use the dupout option. In the proc sort code below, the output dataset dups contains all the duplicate observations removed by the nodupkeys option. 

proc sort data=enrollment nodupkeys dupout=dups;

    by _all_;

run;

To remove duplicates in SAS® Enterprise Guide, use the following selections.   Choose “Options” from the menu on the left and then choose “Keep only one of each record which is entirely duplicated.”



After running the procedure, the student "Kyle Last_Name1" is no longer duplicated.

III. Linguistic Sorting

In the above example, note that one student has a last name beginning with a lower-case letter.  By default, SAS® sorts uppercase and lowercase letters separately, so the student with “last_Name3” has been placed at the end of the database.  This default method can provide confusing results when real last names such as “de Costa” are placed at the end of the database. 

To overcome this confusion, SAS® 9.2 (and higher versions) includes a linguistic collation option which sorts characters according to rules of the language specified in the current locale setting.  By using the option sortseq=linguistic with the sub-option strength=primary on the proc sort statement, one can easily sort databases by names where uppercase and lowercase letters are not sorted separately.   Below is an example of code in which the observation with last name “de Costa” will be placed alphabetically regardless of letter case.

proc sort data=enrollment sortseq=linguistic(strength=primary);

    by Last_Name First_Name;

run;

The linguistically sorted database is below.



Currently, the linguistic sort option is not available in SAS® Enterprise Guide.  SAS® technical support has opened a request to add the linguistic sort to Enterprise Guide in a future release.

Reference:  Base SAS® 9.3 Procedures Guide, (February 2012).

 

 AIR Comment Board

 
To add a comment, Sign In
Total Comments: 3
 
John posted on 3/28/2012 11:27 AM
Good tip! I hadn't know of the linguistic option. Now I'm curious, how is this different from UPCASE-ing the last name before doing the sort? Will I get the same result, or is the linguistic option doing something more?
Irene posted on 3/29/2012 10:47 AM
Good tips! The linguistic sorting feature is valuable for accurately sorting hand-entered data which is more likely to vary in capitalization. Other applications include generating lists and labels for student services programs or when conducting mailed surveys that require ZIP code or city sorting.

Getting at unduplicated counts is critical in data processing, such as to find unduplicated lists and counts of students in at least one online course.
Onecia posted on 3/30/2012 10:11 AM
John, I believe the upcase function would produce the same results. I see the advantage of the linguistic sort method being that an upcased version of the variable (or new variable) would not need to be created.

Past Tech Tips

Using Graphics More Effectively in Word

Creating a TOC in MIcrosoft Word

Auto Grouping in Excel PDF

Use SPSS to Test the Difference in Two Independent Proportions PDF

Automate Import Process in Access PDF

Hide Buttons in Excel PivotCharts PDF