SAS Hash Object: My New Best Friend Demonstration Of Time Savings Using A Hash Object By Denise A. Kruse SAS Contractor.

Post on 01-Apr-2015

213 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

SAS Hash Object: My New Best Friend

Demonstration Of Time Savings Using A Hash ObjectBy Denise A. Kruse

SAS Contractor

2

Program Objectives

campaign11,346 obs

disposition 97 obs

program 446 obs

disposition category 6 obs

dec_offers

6,145,029 obs

3

Matching Datasets

What is the best way to get the fields from the 4 small datasets into the main population of 6.1 million observations?

• PROC merge• HASH

4

PROC merge

• Both datasets need to be sorted prior to the merge• Merge datasets• Sort again• Merge again

5

proc sort data=oms_prod.disposition out=disp ;by disposition_id ; run ;

proc sort data=dec_offers ;by disposition_id ; run ;

data dec_match ;merge dec_offers (in=a) disp(keep=disposition_id description touched

disposition_category_code in=b) ;by disposition_id ;if a and b ;run ;

Sort / Merge Code

6

NOTE: Sorting was performed by the data source.NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION.NOTE: The data set WORK.DISP has 97 observations and 10 variables.NOTE: Compressing data set WORK.DISP decreased size by 0.00 percent. Compressed is 2 pages; un-compressed would require 2 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 0.23 seconds cpu time 0.00 seconds

NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.NOTE: The data set WORK.DEC_OFFERS has 6145029 observations and 4 variables.NOTE: Compressing data set WORK.DEC_OFFERS increased size by 58.15 percent. Compressed is 38412 pages; un-compressed would require 24289 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 28.44 seconds cpu time 39.81 seconds

NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.NOTE: There were 97 observations read from the data set WORK.DISP.NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27499 pages; un-compressed would require 109733 pages.

NOTE: DATA statement used (Total process time): real time 42.81 seconds cpu time 42.58 seconds

Log For Sort / Merge

7

proc sort data=oms_prod.campaign out=camp ;by campaign_id ; run ;

proc sort data=dec_match ;by campaign_id ; run ;

data dec_match2 ;merge dec_match (in=a)

camp(keep=campaign_id program_id campaign_code description

in=b) ;by campaign_id ;if a and b ;run ;

Sort / Merge Code Continued

8

NOTE: Sorting was performed by the data source.NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN.NOTE: The data set WORK.CAMP has 11346 observations and 19 variables.NOTE: Compressing data set WORK.CAMP decreased size by 43.48 percent. Compressed is 143 pages; un-compressed would require 253 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 0.67 seconds cpu time 0.43 seconds

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27496 pages; un-compressed would require 109733 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 1:09.07 cpu time 1:59.52

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.NOTE: There were 11346 observations read from the data set WORK.CAMP.NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages.

NOTE: DATA statement used (Total process time): real time 51.29 seconds cpu time 51.05 seconds

Log For Sort / Merge

9

Sort / Merge Code Continued

proc sort data=oms_prod.program out=pgm ;by program_id ; run ;

proc sort data=dec_match2 ;by program_id ; run ;

data dec_match3 ;merge dec_match (in=a) pgm(keep=program_id name

in=b) ;by program_id ;if a and b ;run ;

10

Log For Sort / Merge

NOTE: Sorting was performed by the data source.NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM.NOTE: The data set WORK.PGM has 446 observations and 16 variables.NOTE: Compressing data set WORK.PGM decreased size by 40.00 percent. Compressed is 6 pages; un-compressed would require 10 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 0.25 seconds cpu time 0.03 seconds

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 1:17.37 cpu time 2:02.37

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: There were 446 observations read from the data set WORK.PGM.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26016 pages; un-compressed would require 93107 pages.

NOTE: DATA statement used (Total process time): real time 59.00 seconds cpu time 58.97 seconds

11

Sort / Merge Code

proc sort data= oms_prod.disposition_category out=disp_cat(rename=(description=disp_desc)) ;

by disposition_category_code ; run ;

proc sort data=dec_match3 ;by disposition_category_code ; run ;

data dec_match4 ;merge dec_match3 (in=a)

disp_cat(keep=disposition_category_code disp_desc in=b) ;

by disposition_category_code ;if a and b ;run ;

12

Log For Sort / Merge

NOTE: Sorting was performed by the data source.NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY.NOTE: The data set WORK.DISP_CAT has 6 observations and 2 variables.NOTE: Compressing data set WORK.DISP_CAT increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 0.03 seconds cpu time 0.02 seconds

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26017 pages; un-compressed would require 93107 pages.

NOTE: PROCEDURE SORT used (Total process time): real time 1:26.08 cpu time 2:14.65

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: There were 6 observations read from the data set WORK.DISP_CAT.NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables.NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 71.05 percent. Compressed is 31209 pages; un-compressed would require 107808 pages.

NOTE: DATA statement used (Total process time): real time 1:03.35 cpu time 1:03.28

13

HASH code

data dec_match ; if _n_ = 1 then do ; IF 0 then set oms_prod.disposition(keep=disposition_id

description touched disposition_category_code ) ;

declare hash ht(dataset: "oms_prod.disposition") ; ht.defineKEY("disposition_id ") ; ht.defineData("disposition_id ", "description “

“touched","disposition_category_code") ; ht.defineDone() ; end ; set dec_offers ;

if ht.find()=0 ; run ;

No sorting !!

14

HASH Log

NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION.

NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS.

NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables.

NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent.

Compressed is 27499 pages; un-compressed would require 109733 pages.

NOTE: DATA statement used (Total process time):

real time 48.38 seconds

cpu time 48.14 seconds

15

HASH Code

data dec_match2 ; if _n_ = 1 then do ; IF 0 then set oms_prod.campaign(keep=campaign_id

program_id campaign_code description ) ; declare hash ht(dataset: "oms_prod.campaign") ; ht.defineKEY("campaign_id") ; ht.defineData("campaign_id", "program_id",

"campaign_code", "description") ; ht.defineDone() ; end ; set dec_match ;if ht.find()=0 ; run ;

16

HASH Log

NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN.

NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH.

NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables.

NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 38.33 percent.

Compressed is 39071 pages; un-compressed would require 63352 pages.

NOTE: DATA statement used (Total process time):

real time 55.35 seconds

cpu time 55.21 seconds

17

HASH Code

data dec_match3; if _n_ = 1 then do; IF 0 then set oms_prod.program(keep=program_id

name );

declare hash ht(dataset: "oms_prod.program"); ht.defineKEY("program_id"); ht.defineData("program_id", "name"); ht.defineDone(); end; set dec_match2;

if ht.find()=0; run;

18

HASH Log

NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM.NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2.NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables.NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 48.53 percent. Compressed is 43928 pages; un-compressed would require 85348 pages.

NOTE: DATA statement used (Total process time): real time 1:00.38 cpu time 1:00.17

19

HASH Code

data disposition_category (rename=(description=disp_desc)); set oms_prod.disposition_category; run;

data dec_match4; if _n_ = 1 then do; IF 0 then set

disposition_category(keep=disposition_category_code disp_desc); declare hash ht(dataset: "disposition_category"); ht.defineKEY("disposition_category_code"); ht.defineData("disposition_category_code", "disp_desc"); ht.defineDone(); end; set dec_match3;if ht.find()=0; run;

20

HASH Log

NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY.NOTE: The data set WORK.DISPOSITION_CATEGORY has 6 observations and 2 variables.NOTE: Compressing data set WORK.DISPOSITION_CATEGORY increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages.

NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds

NOTE: There were 6 observations read from the data set WORK.DISPOSITION_CATEGORY.NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3.NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables.NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 49.47 percent. Compressed is 51750 pages; un-compressed would require 102418 pages.

NOTE: DATA statement used (Total process time): real time 1:02.45 cpu time 1:02.30

21

Comparison Of Processing Time

Sort / Merge HASH

~70 sec 48 sec

~2 min 55 sec

~2 min 16 sec 1 min

~2 min 29 sec 1 min 2 sec

~8 min TOTAL ~4 min TOTAL

dec_match

dec_match2

dec_match3

dec_match4

22

Conclusion

When looking for efficiencies HASH objects are definitely worth considering. In larger programs, HASH objects can save valuable processing time.

23

References

Linda Jolley – Using Table Lookup Techniques Efficiently

Jason Secosky – The DATA Step In Version 9: What’s New?

Paul Dorfman- DATA Step HASH Objects As Programming Tools

24

Contact Information

Denise A. Kruse

SAS Contractor

DeniseAKruse@gmail.com

top related