SAS Hash Object: My New Best Friend Demonstration Of Time Savings Using A Hash Object By Denise A. Kruse SAS Contractor
SAS Hash Object: My New Best Friend
Demonstration Of Time Savings Using A Hash ObjectBy Denise A. Kruse
SAS Contractor
2
Program Objectives
campaign11,346 obs
disposition 97 obs
program 446 obs
disposition category 6 obs
dec_offers
6,145,029 obs
3
Matching Datasets
What is the best way to get the fields from the 4 small datasets into the main population of 6.1 million observations?
• PROC merge• HASH
4
PROC merge
• Both datasets need to be sorted prior to the merge• Merge datasets• Sort again• Merge again
5
proc sort data=oms_prod.disposition out=disp ;by disposition_id ; run ;
proc sort data=dec_offers ;by disposition_id ; run ;
data dec_match ;merge dec_offers (in=a) disp(keep=disposition_id description touched
disposition_category_code in=b) ;by disposition_id ;if a and b ;run ;
Sort / Merge Code
6
NOTE: Sorting was performed by the data source.NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION.NOTE: The data set WORK.DISP has 97 observations and 10 variables.NOTE: Compressing data set WORK.DISP decreased size by 0.00 percent. Compressed is 2 pages; un-compressed would require 2 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 0.23 seconds cpu time 0.00 seconds
NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.NOTE: The data set WORK.DEC_OFFERS has 6145029 observations and 4 variables.NOTE: Compressing data set WORK.DEC_OFFERS increased size by 58.15 percent. Compressed is 38412 pages; un-compressed would require 24289 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 28.44 seconds cpu time 39.81 seconds
NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.NOTE: There were 97 observations read from the data set WORK.DISP.NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27499 pages; un-compressed would require 109733 pages.
NOTE: DATA statement used (Total process time): real time 42.81 seconds cpu time 42.58 seconds
Log For Sort / Merge
7
proc sort data=oms_prod.campaign out=camp ;by campaign_id ; run ;
proc sort data=dec_match ;by campaign_id ; run ;
data dec_match2 ;merge dec_match (in=a)
camp(keep=campaign_id program_id campaign_code description
in=b) ;by campaign_id ;if a and b ;run ;
Sort / Merge Code Continued
8
NOTE: Sorting was performed by the data source.NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN.NOTE: The data set WORK.CAMP has 11346 observations and 19 variables.NOTE: Compressing data set WORK.CAMP decreased size by 43.48 percent. Compressed is 143 pages; un-compressed would require 253 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 0.67 seconds cpu time 0.43 seconds
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27496 pages; un-compressed would require 109733 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 1:09.07 cpu time 1:59.52
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.NOTE: There were 11346 observations read from the data set WORK.CAMP.NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages.
NOTE: DATA statement used (Total process time): real time 51.29 seconds cpu time 51.05 seconds
Log For Sort / Merge
9
Sort / Merge Code Continued
proc sort data=oms_prod.program out=pgm ;by program_id ; run ;
proc sort data=dec_match2 ;by program_id ; run ;
data dec_match3 ;merge dec_match (in=a) pgm(keep=program_id name
in=b) ;by program_id ;if a and b ;run ;
10
Log For Sort / Merge
NOTE: Sorting was performed by the data source.NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM.NOTE: The data set WORK.PGM has 446 observations and 16 variables.NOTE: Compressing data set WORK.PGM decreased size by 40.00 percent. Compressed is 6 pages; un-compressed would require 10 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 0.25 seconds cpu time 0.03 seconds
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 1:17.37 cpu time 2:02.37
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: There were 446 observations read from the data set WORK.PGM.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26016 pages; un-compressed would require 93107 pages.
NOTE: DATA statement used (Total process time): real time 59.00 seconds cpu time 58.97 seconds
11
Sort / Merge Code
proc sort data= oms_prod.disposition_category out=disp_cat(rename=(description=disp_desc)) ;
by disposition_category_code ; run ;
proc sort data=dec_match3 ;by disposition_category_code ; run ;
data dec_match4 ;merge dec_match3 (in=a)
disp_cat(keep=disposition_category_code disp_desc in=b) ;
by disposition_category_code ;if a and b ;run ;
12
Log For Sort / Merge
NOTE: Sorting was performed by the data source.NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY.NOTE: The data set WORK.DISP_CAT has 6 observations and 2 variables.NOTE: Compressing data set WORK.DISP_CAT increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 0.03 seconds cpu time 0.02 seconds
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26017 pages; un-compressed would require 93107 pages.
NOTE: PROCEDURE SORT used (Total process time): real time 1:26.08 cpu time 2:14.65
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: There were 6 observations read from the data set WORK.DISP_CAT.NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables.NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 71.05 percent. Compressed is 31209 pages; un-compressed would require 107808 pages.
NOTE: DATA statement used (Total process time): real time 1:03.35 cpu time 1:03.28
13
HASH code
data dec_match ; if _n_ = 1 then do ; IF 0 then set oms_prod.disposition(keep=disposition_id
description touched disposition_category_code ) ;
declare hash ht(dataset: "oms_prod.disposition") ; ht.defineKEY("disposition_id ") ; ht.defineData("disposition_id ", "description “
“touched","disposition_category_code") ; ht.defineDone() ; end ; set dec_offers ;
if ht.find()=0 ; run ;
No sorting !!
14
HASH Log
NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION.
NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.
NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.
NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent.
Compressed is 27499 pages; un-compressed would require 109733 pages.
NOTE: DATA statement used (Total process time):
real time 48.38 seconds
cpu time 48.14 seconds
15
HASH Code
data dec_match2 ; if _n_ = 1 then do ; IF 0 then set oms_prod.campaign(keep=campaign_id
program_id campaign_code description ) ; declare hash ht(dataset: "oms_prod.campaign") ; ht.defineKEY("campaign_id") ; ht.defineData("campaign_id", "program_id",
"campaign_code", "description") ; ht.defineDone() ; end ; set dec_match ;if ht.find()=0 ; run ;
16
HASH Log
NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN.
NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.
NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.
NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 38.33 percent.
Compressed is 39071 pages; un-compressed would require 63352 pages.
NOTE: DATA statement used (Total process time):
real time 55.35 seconds
cpu time 55.21 seconds
17
HASH Code
data dec_match3; if _n_ = 1 then do; IF 0 then set oms_prod.program(keep=program_id
name );
declare hash ht(dataset: "oms_prod.program"); ht.defineKEY("program_id"); ht.defineData("program_id", "name"); ht.defineDone(); end; set dec_match2;
if ht.find()=0; run;
18
HASH Log
NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM.NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 48.53 percent. Compressed is 43928 pages; un-compressed would require 85348 pages.
NOTE: DATA statement used (Total process time): real time 1:00.38 cpu time 1:00.17
19
HASH Code
data disposition_category (rename=(description=disp_desc)); set oms_prod.disposition_category; run;
data dec_match4; if _n_ = 1 then do; IF 0 then set
disposition_category(keep=disposition_category_code disp_desc); declare hash ht(dataset: "disposition_category"); ht.defineKEY("disposition_category_code"); ht.defineData("disposition_category_code", "disp_desc"); ht.defineDone(); end; set dec_match3;if ht.find()=0; run;
20
HASH Log
NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY.NOTE: The data set WORK.DISPOSITION_CATEGORY has 6 observations and 2 variables.NOTE: Compressing data set WORK.DISPOSITION_CATEGORY increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds
NOTE: There were 6 observations read from the data set WORK.DISPOSITION_CATEGORY.NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables.NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 49.47 percent. Compressed is 51750 pages; un-compressed would require 102418 pages.
NOTE: DATA statement used (Total process time): real time 1:02.45 cpu time 1:02.30
21
Comparison Of Processing Time
Sort / Merge HASH
~70 sec 48 sec
~2 min 55 sec
~2 min 16 sec 1 min
~2 min 29 sec 1 min 2 sec
~8 min TOTAL ~4 min TOTAL
dec_match
dec_match2
dec_match3
dec_match4
22
Conclusion
When looking for efficiencies HASH objects are definitely worth considering. In larger programs, HASH objects can save valuable processing time.
23
References
Linda Jolley – Using Table Lookup Techniques Efficiently
Jason Secosky – The DATA Step In Version 9: What’s New?
Paul Dorfman- DATA Step HASH Objects As Programming Tools