SAS Notes
Compiled by : Wilson Suraweera @ CGHR
- Contents
- Random Numbers in SAS
- Data Sub-setting
- Selection of two Random Samples
- Identification
and Separation of duplicate records in SAS dataset
-
Reshape Long list variable dataset to
Wide list dataset
- Missing values replacement with Means of the
respective Variables
- Logistic Regression with SAS
- SAS made
easy using Proc SQL
- SAS PROC SQL procedure to access external ODBC
data sources
- SAS String Data Handling
- SAS Missing value arithmetic's
-
Update_a_Table_using_Another_Table_in_SAS
-
Text analysis - an Epidemiological Case Study by WS - SAS Institute HUG -01 April2011
-
- External Resources
-
SAS Knowledge Base - Glossary of SAS Procedures from SAS.com
-
SAS resources from
- UCLA
- Logistic regression
-
Survival Analysis : Usage of Proc LifeTest and Proc PHREG
- SAS Dinosaur - Old and New way
of SAS programming
- Paul Dicman's Web Page for SAS- This little old discusses SAS 8, but useful
- Global Statements Dictionary -
Alphabetical listing and Description of SAS Key words
- SAS Study Blog
- SAS Canada - User Groups
-
/* Copy these codes and make your own program */
Identification and separation of duplicate records in a SAS
dataset
Objective:
This example illustrate:
(a) Test whether data has duplicate records
(b) Separation of duplicate and non-duplicate data
(c) Elimination of all duplicate records
Method 1: Elimination of duplicate records attributed to a key variable, using "first" or "last" options
Description:
The variable 'id_no' in dataset 'test' uniquely identifies the individuals in this data.
Assume accidentally id_no=17 and id_no=18 have repeated in this data.
(1) Read data into SAS dataset
Data Test;
INPUT id_no sex $ age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
17 M 50 14 5 5 3
65 F 52 9 4 7 7
81 M 44 11 7 7 7
18 M 34 17 6 5 3
18 M 34 17 6 5 2
34 F 47 6 6 5 6
18 M 34 17 6 5 3
;
run;
(2) Sort data by variable 'id_no' ;
proc sort data=test;
by id_no;
run;
proc print data=test; run;
Print Output:
Obs id_no sex age inc r1 r2 r3
1 1 F 35 17 7 2 2
2 17 M 50 14 5 5 3
3 17 M 50 14 5 5 3
4 18 M 34 17 6 5 3
5 18 M 34 17 6 5 3
6 18 M 34 17 6 5 2
7 33 F 45 6 7 2 7
8 34 F 47 6 6 5 6
9 65 F 52 9 4 7 7
10 81 M 44 11 7 7 7
;
(3) Separate data into 2 datasets. Non-repeating data output to "Norepeat" dataset and repeating more than one time data output to "Repeat";
data Norepeat Repeat;
set test;
by id_no;
if first.id_no then output Norepeat;
else output Repeat;
run;
Proc print data=Norepeat; run;
Proc print data=Repeat; run;
Output:
dataset=Norepeat
Obs id_no sex age inc r1 r2 r3
1 1 F 35 17 7 2 2
2 17 M 50 14 5 5 3
3 18 M 34 17 6 5 3
4 33 F 45 6 7 2 7
5 34 F 47 6 6 5 6
6 65 F 52 9 4 7 7
7 81 M 44 11 7 7 7
dataset=Repeat
Obs id_no sex age inc r1 r2 r3
1 17 M 50 14 5 5 3
2 18 M 34 17 6 5 2
3 18 M 34 17 6 5 3
Method 2: Using nodup and nodupkey options
In Proc Sort procedure 'nodup' and 'nodupkey' could also be used to eliminate duplicate records in a dataset
Note 'nodup' and 'nodupkey' are very tricky confusing options to use and need very carefully handle.
Here also we use the same 'Test' dataset used in the Method 1.
You would be noticed 2 recodes attributed to Id_no=17 are identical records and only 2 of 3 records attributed
to id_no=18 are identical.
(1) "nodup" option use
proc sort nodup data=test out=norepeat;
by id_no;
run;
proc print data=norepeat; run;
In following output you would be seen Repeated observation attributed to Id_no 17 has now eliminated.
But repeated records attributed to Id_no=18 is remaining. Reason for not elimination is obs no 3 and 5
are identical but they are not in the sequential positions.
Obs id_no sex age inc r1 r2 r3
1 1 F 35 17 7 2 2
2 17 M 50 14 5 5 3
3 18 M 34 17 6 5 3
4 18 M 34 17 6 5 2
5 18 M 34 17 6 5 3
6 33 F 45 6 7 2 7
7 34 F 47 6 6 5 6
8 65 F 52 9 4 7 7
9 81 M 44 11 7 7 7
You can test putting obs 3 and 5 in adjacent and run the code it would eliminate one of this record
and become dataset size to 8 with non-duplicated records.
(2) "nodupkey" option use
proc sort nodupkey data=test out=repeat;
by id_no;
run;
proc print data=repeat; run;
In following output you can see all repeating records attributed to id_no has been eliminated.
output:
Obs id_no sex age inc r1 r2 r3
1 1 F 35 17 7 2 2
2 17 M 50 14 5 5 3
3 18 M 34 17 6 5 3
4 33 F 45 6 7 2 7
5 34 F 47 6 6 5 6
6 65 F 52 9 4 7 7
7 81 M 44 11 7 7 7
Method 3: What about removing all duplicate valuse blindly ?
This is a smart short trick that if you sort all variables in ascending order using 'by _all_" and apply 'nodup' option then
all duplicated recods will be eliminated.
See in the outthe output for id_no 18, 2 records remains because their r3 values are different.
proc sort nodup data=test out=noduplicate;
by _all_;
run;
proc print data=noduplicate; run;
output:
id_no sex age inc r1 r2 r3
1 F 35 17 7 2 2
17 M 50 14 5 5 3
18 M 34 17 6 5 2
18 M 34 17 6 5 3
33 F 45 6 7 2 7
34 F 47 6 6 5 6
65 F 52 9 4 7 7
81 M 44 11 7 7 7
CONCLUSION: The NODUP option in the SORT procedure eliminates observations that are exactly the same and adjacent
"across all variables". The NODUPKEY option eliminates observations that are exactly the same
across the "BY variables". Keep in mind that both of these options compare adjacent observations
in the output data set. Now you may be understand how the NODUP and NODUPKEY options work,
you can use them in confidence to get the data set you want!
Suraweera Wilson's SAS Notes