SAS Notes

Compiled by : Wilson Suraweera @ CGHR

Contents: Random Numbers in SAS; Data Sub-setting; Selection of two Random Samples; Identification and Separation of duplicate records in SAS dataset; Reshape Long list variable dataset to Wide list dataset; Missing values replacement with Means of the respective Variables; Logistic Regression with SAS; SAS made easy using Proc SQL; SAS PROC SQL procedure to access external ODBC data sources; SAS String Data Handling; SAS Missing value arithmetic's
Update_a_Table_using_Another_Table_in_SAS: Text analysis - an Epidemiological Case Study by WS - SAS Institute HUG -01 April2011
External Resources: SAS Knowledge Base - Glossary of SAS Procedures from SAS.com; SAS resources from - UCLA; Logistic regression
Survival Analysis : Usage of Proc LifeTest and Proc PHREG: SAS Dinosaur - Old and New way of SAS programming; Paul Dicman's Web Page for SAS- This little old discusses SAS 8, but useful; Global Statements Dictionary - Alphabetical listing and Description of SAS Key words; SAS Study Blog; SAS Canada - User Groups

/* Copy these codes and make your own program */

Identification and separation of duplicate records in a SAS dataset

Objective: 

	This example illustrate:
	  (a) Test whether data has duplicate records
	  (b) Separation of duplicate and non-duplicate data
	  (c) Elimination of all duplicate records


Method 1: Elimination of duplicate records attributed to a key variable, using "first" or "last" options 

Description:
       The variable  'id_no' in dataset 'test' uniquely identifies the individuals in this data.
       Assume accidentally id_no=17 and id_no=18 have repeated in this data.


    (1) Read data into SAS dataset

		   Data Test;
		   INPUT id_no sex $ age inc r1 r2 r3 ; 
		   DATALINES; 
				 1  F  35 17  7 2 2 
				17  M  50 14  5 5 3 
				33  F  45  6  7 2 7 
				17  M  50 14  5 5 3 
				65  F  52  9  4 7 7 
				81  M  44 11  7 7 7 
				18  M  34 17  6 5 3 
				18  M  34 17  6 5 2 
				34  F  47  6  6 5 6 
				18  M  34 17  6 5 3  
				; 
		   run;

    (2) Sort data by variable 'id_no' ;

				proc sort data=test;
				   by id_no;
				run;
				
				proc print data=test; run;

      Print Output:
				Obs id_no sex age inc r1 r2 r3 
				1  1 F 35 17 7 2 2 
				2 17 M 50 14 5 5 3 
				3 17 M 50 14 5 5 3 
				4 18 M 34 17 6 5 3 
				5 18 M 34 17 6 5 3 
				6 18 M 34 17 6 5 2 
				7 33 F 45  6 7 2 7 
				8 34 F 47  6 6 5 6 
				9 65 F 52  9 4 7 7 
			   10 81 M 44 11 7 7 7 

            ;

    (3) Separate data into 2 datasets. Non-repeating data output to "Norepeat" dataset and repeating more than one time data output to "Repeat";

			  data Norepeat Repeat;
			     set test;
			       by id_no;
			     if first.id_no then output Norepeat;
			     else output Repeat;
			  run; 
			  
			  Proc print data=Norepeat; run;
			  Proc print data=Repeat; run;

Output:

              dataset=Norepeat
           		               Obs   id_no    sex      age     inc     r1    r2  r3 
					1 	1  	F 	35 	17 	7    2 	  2 
					2 	17 	M 	50 	14 	5    5    3 
					3 	18 	M 	34 	17 	6    5    3 
					4 	33 	F 	45 	6 	7    2    7 
					5 	34 	F 	47 	6 	6    5    6 
					6 	65 	F 	52 	9 	4    7    7 
					7 	81 	M 	44 	11 	7    7    7 

              dataset=Repeat
					Obs  id_no     sex    age inc  r1 r2 r3 
					  1 	17 	M 	50 14 	5 5 3 
					  2 	18 	M 	34 17 	6 5 2 
					  3 	18 	M 	34 17 	6 5 3 

Method 2: Using nodup and nodupkey options 
            
            In Proc Sort procedure 'nodup' and 'nodupkey' could also be used to eliminate duplicate records in a dataset
            Note 'nodup' and 'nodupkey' are very tricky confusing options to use and need very carefully handle.
            Here also we use the same 'Test' dataset used in the Method 1.
            
            You would be noticed 2 recodes attributed to Id_no=17 are identical records and only 2 of 3 records attributed 
            to id_no=18 are identical.
            
     (1)  "nodup" option use
                  
				proc sort nodup data=test out=norepeat;
				   by id_no;
				run;
				
				proc print data=norepeat; run;
			
			In following output you would be seen Repeated observation attributed to Id_no 17 has now eliminated.
			But repeated records attributed to Id_no=18 is remaining. Reason for not elimination is obs no 3 and 5 
			are identical but they are not in the sequential positions.
			
				  Obs id_no sex age inc r1 r2 r3 
					1 	1 	F 	35 	17 	7 	2 2 
					2 	17 	M 	50 	14 	5 	5 3 
					3 	18 	M 	34 	17 	6 	5 3 
					4 	18 	M 	34 	17 	6 	5 2 
					5 	18 	M 	34 	17 	6 	5 3 
					6 	33 	F 	45 	6 	7 	2 7 
					7 	34 	F 	47 	6 	6 	5 6 
					8 	65 	F 	52 	9 	4 	7 7 
					9 	81 	M 	44 	11 	7 	7 7 
 
			You can test putting obs 3 and 5 in adjacent and run the code it would eliminate one of this record 
			and become  dataset size to 8 with non-duplicated records.	
				
	  (2)  "nodupkey" option use
     
             
				proc sort nodupkey data=test out=repeat;
				   by id_no;
				run;
				proc print data=repeat; run;
				
				In following output you can see all repeating records attributed to id_no has been eliminated.
				
			 output:
					Obs id_no  sex age inc r1 r2 r3 
					 1 		1 	F 	35 	17 	7 2  2 
					 2 		17 	M 	50 	14 	5 5  3 
					 3 		18 	M 	34 	17 	6 5  3 
					 4 		33 	F 	45 	6 	7 2  7 
					 5 		34 	F 	47 	6 	6 5  6 
					 6 		65 	F 	52 	9 	4 7  7 
					 7 		81 	M 	44 	11 	7 7  7 
Method 3: What about removing all duplicate valuse blindly ? 

                This is a smart short trick that if you sort all variables in ascending order using 'by _all_" and apply 'nodup' option then 
                all duplicated recods will be eliminated. 
                See in the outthe output for id_no 18, 2 records remains because their r3 values are different. 
                
				proc sort nodup data=test out=noduplicate;
				   by _all_;
				run;
				
				proc print data=noduplicate; run;
				
			 output:
					id_no sex age  inc r1 r2 r3 
						1 	F 35 	17 	7 2  2 
						17 	M 50 	14 	5 5  3 
						18	M 34 	17 	6 5  2 
						18 	M 34 	17 	6 5  3 
						33 	F 45 	6 	7 2  7 
						34 	F 47 	6 	6 5  6 
						65 	F 52 	9 	4 7  7 
						81 	M 44 	11 	7 7  7 



		CONCLUSION:  The NODUP option in the SORT procedure eliminates observations that are exactly the same and adjacent 
			     "across all variables". The NODUPKEY option eliminates observations that are exactly the same
			     across the "BY variables". Keep in mind that both of these options compare adjacent observations
			     in the output data set. Now you may be understand how the NODUP and NODUPKEY options work, 
			     you can use them in confidence to get the data set you want!	
				
				
	 Suraweera Wilson's SAS Notes