This Tech Tip was provided by Onecia Gibson, Assistant Director Research and Analysis for the Office of Institutional Research, University of Kentucky.
The SAS® sort procedure is used to sort SAS® databases by specified variables. This tech tip will illustrate two useful sorting options, nodupkeys and sortseq, and both will be presented using Base SAS® and SAS® Enterprise Guide.
I. Sorting Data in SAS®
Suppose you have an enrollment database and wish to sort the data by the students’ last and first names. You might use the following code in Base SAS®.
proc sort data=enrollment;
by Last_Name First_Name;
To sort by last and first name in SAS® Enterprise Guide, open an SAS® database and choose Data . . . Sort Data.
In the sort window, highlight the variable “Last_Name,” click the right arrow, and choose “Sort by.” Do the same for First_Name and click the “Run” button.
Below is the sorted database.
II. Removing Duplication Observations
Despite our best data manipulation skills, we sometimes find duplicate observations (records) in databases. In the example above, notice that the student “Kyle Last_Name1” is duplicated. An easy method by which to remove duplicate observations is by using the nodupkeys option in the Base SAS® sort procedure.
proc sort data=enrollment nodupkeys;
Using the variable identifier _all_ with the nodupkeys option tells SAS® to remove the duplicate observations where all variables are identical to a previous observation. If you need to know which observations were removed, use the dupout option. In the proc sort code below, the output dataset dups contains all the duplicate observations removed by the nodupkeys option.
proc sort data=enrollment nodupkeys dupout=dups;
To remove duplicates in SAS® Enterprise Guide, use the following selections. Choose “Options” from the menu on the left and then choose “Keep only one of each record which is entirely duplicated.”
After running the procedure, the student "Kyle Last_Name1" is no longer duplicated.
III. Linguistic Sorting
In the above example, note that one student has a last name beginning with a lower-case letter. By default, SAS® sorts uppercase and lowercase letters separately, so the student with “last_Name3” has been placed at the end of the database. This default method can provide confusing results when real last names such as “de Costa” are placed at the end of the database.
To overcome this confusion, SAS® 9.2 (and higher versions) includes a linguistic collation option which sorts characters according to rules of the language specified in the current locale setting. By using the option sortseq=linguistic with the sub-option strength=primary on the proc sort statement, one can easily sort databases by names where uppercase and lowercase letters are not sorted separately. Below is an example of code in which the observation with last name “de Costa” will be placed alphabetically regardless of letter case.
proc sort data=enrollment sortseq=linguistic(strength=primary);
by Last_Name First_Name;
The linguistically sorted database is below.
Currently, the linguistic sort option is not available in SAS® Enterprise Guide. SAS® technical support has opened a request to add the linguistic sort to Enterprise Guide in a future release.
Reference: Base SAS® 9.3 Procedures Guide, (February 2012).
Past Tech Tips
Using Graphics More Effectively in Word
Creating a TOC in MIcrosoft Word
Auto Grouping in Excel PDF
Use SPSS to Test the Difference in Two Independent Proportions PDF
Automate Import Process in Access PDF
Hide Buttons in Excel PivotCharts PDF