Top Banner
Base SAS programming skills © 2008 Infosys Technologies Ltd. Strictly private and confidential. No part of this document should be reproduced or distributed without the prior permission of Infosys Technologies Ltd. 1 -Archit Kumar SI –BI BOFA
126
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Base SAS Programming Fundamentals

Base SAS programming skills

© 2008 Infosys Technologies Ltd. Strictly private and confidential. No part of this document should be reproduced or distributed without the prior permission of Infosys Technologies Ltd.

1

-Archit Kumar

SI –BI

BOFA

-Archit Kumar

SI –BI

BOFA

Page 2: Base SAS Programming Fundamentals

2

Contents

Introduction to SAS

Introduction to SAS programs

SAS dada libraries.

Producing list report- Print procedure

Customizing report appearance – creating HTML reports

Reading raw data files

Dropping and keeping variables

Concatenating SAS data sets

Producing summary reports

Introduction to graphics

Controlling input and output

Summarizing data

Page 3: Base SAS Programming Fundamentals

3

Reading and writing different types of data

Data transformation –

- Manipulating character values

- Manipulating numeric values

- Manipulating date values

Do loops in SAS

SAS arrays

Match merging two or more data sets

Using SQL queries in SAS

SAS macros

Basic efficiency techniques

Page 4: Base SAS Programming Fundamentals

4

Overview of SAS system

Functionality of SAS system is built around the four data driven tasks

1.Data access – address the data required by the application

2.Data Management – shapes data into a form required by the application

3.Data analysis – summarizes, reduces, or otherwise transforms raw data into meaningful and useful information

4.Data representation – communicates information in ways that clearly demonstrate its significance

Page 5: Base SAS Programming Fundamentals

5

Data Processing

Raw data

SAS data set

Data Step SAS data set

Proc Step Report

Process of delivering meaningful information –

-Accessing data

- Transforming data

- Managing data

- Storing and retrieving data

- Analysis

Page 6: Base SAS Programming Fundamentals

6

Introduction to SAS program

/*********************Data step*****************************/data work.staff;

infile ‘raw data file’;input LastName $ 1-20 FirstName $21-30 JobTitle $ 36-43

Salary 54- 59;run;/**********************************************************//********************Proc Step******************************/proc print data=work.staff;run;proc means data=work.staff;

class JobTitle;var Salary;

run; /**********************************************************/

Page 7: Base SAS Programming Fundamentals

7

Fundamental Concepts

SAS data setsDescriptor portion proc contents data= SAS data set;Run;Proc contents displays the following information about the data set General information about the data set such as data set name,

number of observation, number of variables etc. Variable attributes such as name, type, length, position, informat,

format etc.

Data portionproc print data= SAS data set;Run;

The data portion shows the data present in the data set in tabular form showing the variables which corresponds to fields and observations which corresponds to the data lines.

Page 8: Base SAS Programming Fundamentals

8

SAS variables

There are two types of variables- Character – Contains any value i.e. letters, numbers,

special characters and blanks. Character values have length ranging from 1 to 32767 characters.

Numeric – Stored as floating point numbers in 8 bytes of storage by default. Eight byte floating point storage provide space for 16 significant digits.

SAS variable names- Can be 32 characters long. Can be uppercase, lowercase or mixed case. Must start with a letter or underscore. Subsequent

characters can be letters, underscore or numeric digits.Date values – SAS date values are stored as numeric values.

Date value is stored as number of days between January 1, 1960.

Page 9: Base SAS Programming Fundamentals

9

SAS - Syntax Rules

Usually begins with an identifying statement

Always end with a semicolon.

SAS statements are free format.

They can begin and end in any column

A single statement can span multiple lines.

Several statements can be on the same line.

Comments-

Multiple line comment begins with /* and ends with */.

Single line comments can be written by putting an asterisk at the beginning of line.

Page 10: Base SAS Programming Fundamentals

10

SAS – Data Libraries

SAS data library is a collection of SAS files that are recognized as a unit by SAS. Where SAS data sets are referred to as SAS files here.Types of SAS libraries –

Temporary library- When SAS is invoked, it automatically gives access to temporary library which is named as work. Datasets made here are removed once the SAS session ends.

Permanent library- SASUSER is the permanent SAS library present in SAS. We can create permanent SAS library using libname statement.Syntax – Libname libref ‘SAS data library’ <options>.Rules- 1. Name of library must be 8 characters or less.

2. Must begin with a letter or underscore. 3. Remaining characters are letters, numbers or underscores.

e.g. libname Test_lib ‘c:\workshop\prog1’;Once the libname is specified, datasets can be created inside the library by refering to the data sate by “libref.filename or libname.data set name”

Page 11: Base SAS Programming Fundamentals

11

PRINT Procedure

General form of print procedure

proc print data= SAS data set;

run;

The print procedure prints the dataset with all the columns adding a column of observation to it, which has the row number.

Features of print procedure

1. Titles and footnotes – Discussed in subsequent slides

2. Formatted value - Discussed in subsequent slides

3. Printing selected variable –

proc print data= ia.empdata;

var empname salary jobcode;

run;

This statement prints the selected variables only in the order in which they are written.

Page 12: Base SAS Programming Fundamentals

12

4. Suppressing the observation columns – NOOBS optionproc print data= ia.empdata noobs;run; 5. Sub setting data – Where statement is used to select some observations only.Syntax – where <condition>;Where condition contains operators(constants or variables) and operands(comparison, logical, special operators or functions).e.g. – Comparison – Where salary>25000;

Logical – Where Jobcode=‘A’ and Salary=25000; similarly or and not can also be used.

Special operator – Between – Where salary between 5000 and 7000; Contains(?) – Where lastname ? ‘LAM’;

Example of proc step with where clause-proc print data= iq.empdata;var Jobcode Empid Salary;where Jobcode = ‘A’ and Salary between 20000 and 30000 ;run;6. Column totals – Sum statement is used to get column total.e.g. proc print data = ia.empdata; var jobcode Salary Empid; sum salary; run;

Page 13: Base SAS Programming Fundamentals

13

Special where statements

Additional special operators supported by where statement are –

Like – It selects observation by comparing character values to specified patterns.

e.g. – where code like ‘E_U%’;

It searches for code value beginning with E, followed by a single character, followed by a U, followed by any number of characters.

Sounds like – The sounds like (=*) operator selects observation that contains spelling variations of the word specified.

e.g. – where name =* ‘SMITH’;

Selects name like SMYTHE and SMITT.

IS NULL or IS MISSING – Selects observations in which the value of the variable is missing.

e.g. – where flight is missing;

where flight is null;

Page 14: Base SAS Programming Fundamentals

14

Sequencing and Grouping observations Sort procedure – Sort procedure is used to sequence the

observation.

1. Re arranges the observations in SAS dataset.

2. Can create new dataset with re arranged data.

3. Can sort on multiple values.

4. Does not generate printed output.

5. Treats missing value as the smallest possible value.

6. Sorts in ascending order by default. Syntax – proc sort data = input dataset out= Output dataset;

by <descending> by-variable;

run;

e.g. proc sort data= ia.empdata out=work.jobsal;

by jobcode descending salary;

run;

Page 15: Base SAS Programming Fundamentals

15

Grouping data and Printing Subtotals and Grand totals – Using a by clause with proc print procedure groups the data according to the different values of that variable.e.g. – proc print data=ia.empdata;

by jobcode;sum salary;run;

The above code groups the data according to jobcode values and the sum statement prints the sum of salary for different groups of jobcode, which is the sub total.Note- Data must be indexed or sorted in order to use by clause.

Page Breaks – PAGEBY statement is used to put each subgroup on a separate page. e.g. - proc print data=ia.empdata;

by jobcode;pageby jobcode;sum salary;run;

Pageby must be used along with a by clause and the variable appearing in the by clause only can be used in the pageby clause.

Page 16: Base SAS Programming Fundamentals

16

Enhancing outputs

ID statement- ID statement is used to suppress the obs column and the variable used with id replaces the obs column i.e. is placed left most. We can use ID statement along with BY statement. ID statement places the variable left most in place of obs and if a BY clause is also there for the same variable then it groups data according to that variable.

e.g. – proc print data=ia.empdata;

id Jobcode;

by Jobcode;

pageby Jobcode;

sum Salary;

run;

The above code will print the output page wise according to groups of Jobcode working as id i.e. in place of obs column and at the end of each page sum of salary values at that page will be displayed.

Page 17: Base SAS Programming Fundamentals

17

Customizing Report Appearance

Titles and Footnotes –

1. Titles appear at the top of the page.

2. Default SAS title is The SAS System.

3. The null title statement, ”title;” , cancels all titles.

4. Footnote appears at the bottom of the page.

5. No footnote appears unless one is specified.

6. The null footnote statement, footnote;, cancels all footnote.

7. More than one titles and footnotes can be specified in one proc step by numbering the title/footnote. E.g. title1 ‘First Line’; title2 ‘Second Line’. After getting the second title first one is cancelled.

8. More than one titles or footnotes can be defined by number them title1,title2,……,titlen. The value of n can be 10.

Page 18: Base SAS Programming Fundamentals

18

Column Labels – This assigns labels to different fields.

e.g. – proc print data=ia.empdata label;

label lastname=‘Last Name’

Firstname=‘First Name’;

run;

split =‘ ‘ option if placed instead of label in the proc print statement , splits the label into two lines based on the delimiter specified.

SAS System Options – SAS options are used to change the appearance of report.

Page 19: Base SAS Programming Fundamentals

19

1. Date – specifies to print the date and time at which SAS session began at the top of each page.

2. Nodate – Specifies not to print the date and time.

3. Linesize =width – Specifies the line size.

4. Pagesize=n - Specifies the number of lines per page.

5. Number – Specifies that page number be printed on the first line of each page output.

6. Nonumber – specifies page number not to be printed.

7. Pageno=n – Specifies the beginning of the page number.

Example – options nodate nonumber ls=72;

Option statement is not placed in a data or proc step.

Page 20: Base SAS Programming Fundamentals

20

Formatting Data Values

To apply a format to a specific SAS variable, use the format statement.

General form of format statement –

FORMAT variable name format;

Example –

proc print data=ia.empdata;

format Salary dollar11.2;

run;

The above code will print the data with salary values formatted, preceded by a dollar sign, with commas, having a total length 11 and 2 decimal places.

Page 21: Base SAS Programming Fundamentals

21

SAS Formats

SAS Formats Description

w.de.g. 8.2

Standard numeric formatWidth=8, 2 decimal places

$w.$5.

Standard character formatWidth=5

Commaw.dComma9.2

Commas in a numberWidth=9, 2 decimal number

Dollarw.dDollar10.2

Dollar sign and commasWidth=10, 2 decimal places

Page 22: Base SAS Programming Fundamentals

22

Date Formats

SAS dates are stored as the number of days between 1st

January 1960 and the specified date. So date formats are used to print dates in the standard form. Date formats available and values they display are(e.g. Date= 16Oct2001)-

Format Displayed Value

MMDDYY6. 101601

MMDDYY8. 10/16/01

MMDDYY10. 10/16/2001

DATE7. 16OCT01

DATE9. 16OCT2001

Page 23: Base SAS Programming Fundamentals

23

User Defined Format

Format procedure can be used to define custom formats.

General from of PROC FORMAT –

proc format;

value format-name range1=‘label’;

……………..;

Example –

proc format;

value gender 1=‘Female’

2=‘Male’

other=‘Miscoded’;

run;

Above code defines a user defined format gender that replaces the values 1, 2 and other with respective labels.

Page 24: Base SAS Programming Fundamentals

24

Assigning character values to and range of characters labels.

proc format;

value $grade ‘A’=‘Good’

‘B’ - ’D’=‘Fair’

‘F’ = ‘Poor’

Other= ‘Miscoded’;

run;

Applying format

proc print data=ia.student;

format CGPA $grade.;

run;

Page 25: Base SAS Programming Fundamentals

25

Creating HTML reports

ODS(Output Delivery System) method is used to create output in variety of forms.

ODS HTML statement opens, closes and manages the HTML destination.

General form of ODS method-

ODS html file=‘HTML file specification’;

SAS code;

ODS html close;

Example –

ODS html file=‘D:\odscode.html’;

proc print data=ia.empdata’;

run;

ODS html close;

Page 26: Base SAS Programming Fundamentals

26

Reading raw data file

Steps for creating SAS data set

Start a data step and name the SAS data set being created(DATA statement).

DATA libref.SAS-data-set

e.g. - data work.dwflax;

Identify the location of the raw data file to read(INFILE statement).

INFILE ‘Filename’

e.g. – infile ‘C:\workshop\dwflax.txt’

Describe how to read the data fields from the raw data file(INPUT statement).

INPUT input – specifications;

Page 27: Base SAS Programming Fundamentals

27

Input specification –

•Name SAS variable.

•Identifies the variable as character or numeric.

•Specifies the locations of the fields in the raw data file.

•Can be specified as column, formatted, list or named input.

Example data set –

Data work.dwflax;

infile ‘C:\workshop\dwflax.txt’;

input Flight $ 1-3 Date $ 4-11

Dest $ 12-14 FirstClass 15-17;

run;

Page 28: Base SAS Programming Fundamentals

28

Formatting Input

Formatted input is used to read data values by –

Moving the input pointer to the starting position of the field.

Specifying a variable name.

Specifying an informat.

Pointer controls –

@n : Moves the pointer to column n.

+n: Moves the pointer n positions.

Informat statement is specified in the following way –

<$> informat – name w.<d>

In the above code $ specifies character value, w specifies the total width of field, ‘.’ specifies the delimiter and d specifies number of decimal places.

Page 29: Base SAS Programming Fundamentals

29

Example

Data work.dfwlax;

Infile ‘Raw data file’;

Input @1 Flight $3.

@4 Date mmddyy8.

@12 Dest $3.

@15 Firstclass 3.

@18 Economy 3.;

Run;

The above code reads Flight starting from 1st position till 3 characters in character format, Date form 4th position in mmddyy8. format, 3 characters for Dest form 12th position in character format, 3 numbers for Firstclass starting form 15th position in integer format and Economy from 18th position till 3 integers.

Page 30: Base SAS Programming Fundamentals

30

Reading SAS data sets

Steps for creating a SAS data set using another data set.

DATA statement to start a DATA step and name the SAS data set being created.

SET statement to identify the SAS data set being read.

To create a variable use assignment statement to modify the values of existing data set variable(s).

Example –

Data work.new_data;

set ia.dwflax;

total = FirtsClass + Economy;

Run;

The above code reads all the fields and observations from dwflax and creates a new field in new_data named total.

Page 31: Base SAS Programming Fundamentals

31

Operators

Operator Action Example Priority

+ Addition Sum = x + y III

- Subtraction Diff = x – y III

* Multiplication Mul = x * y II

/ Division Div = x / y II

** Exponentiation

Raise x ** y I

- Negative prefix

Negative = -x

IOperations of priority I are performed first, then II and III, right to left for priority I and left to right for II and III

Page 32: Base SAS Programming Fundamentals

32

Using SAS functions

SUM function - Calculates the sum of arguments.

e.g. – Total = Sum(FirtsClass,Economy);

Sum function calculates the sum even value is missing for any argument, whereas simple addition does not for any missing value.

Today() – Obtains the date value from system clock.

MDY(month,day,year) – Uses numeric values of month, date and year values to return the corresponding SAS date value.

Year(SAS Date) – extracts year from a SAS date and returns a four digit value.

QTR (SAS Date) – Extracts date from SAS date and returns 1 to 4.

Page 33: Base SAS Programming Fundamentals

33

Month(SAS date) – Extracts month from SAS date and returns from 1 to 12.

Weekday(SAS date) – Extracts day of the week from SAS date returns number from1 to 7, where 1 is Sunday and so on.

Page 34: Base SAS Programming Fundamentals

34

Dropping and Keeping variables

Drop and Keep statements can be used to control what variables are written to the new data set.

General from – Drop variables; / Keep variables;

Example –

data test_new;

set ia.dwflax;

drop FirstClass Economy;

Total = FirstClass + Economy;

run;

The above code creates new data set without FirstClass and Economy variables and with total variable.

Page 35: Base SAS Programming Fundamentals

35

Conditional processing

IF – Then – Else clause can be used to conditionally process rows and select some of the observations.

Example –

data flightrev;

set ia.dwflax;

total=sum(Firstclass,Economy);

if Dest=‘LAX’ then

revenue=sum(2000*Firstclass,1200*Economy);

else if Dest=‘DFW’ then

revenue=sum(1500*Firstclass,900*Economy);

run;

Page 36: Base SAS Programming Fundamentals

36

Executing set of conditional statements

Do and End statement can be used to execute a set of statements.

Example –

data flightrev;

set ia.dwflax;

total=sum(Firstclass,Economy);

if Dest=‘LAX’ then do;

revenue=sum(2000*Firstclass,1200*Economy);

city=‘Dallas’;

end;

else if Dest=‘DFW’ then do;

revenue=sum(1500*Firstclass,900*Economy);

city=‘Los Angeles’;

end;

run;

Page 37: Base SAS Programming Fundamentals

37

Variable Lengths

At compile time, the length of a variable is determined the first time the variable is encountered. To overcome this, we specify length of the variable prior to assignment;

e.g. – In the previous example, first encountered value of city is Dallas, so the length of city is 6 and Los Angeles will be truncated to Los An. To avoid this we can specify length of the variable city before the if condition.

length city $ 11;

‘$’ specifies character value.

Page 38: Base SAS Programming Fundamentals

38

Deleting or Selecting Rows

Rows can be deleted using a Delete statement with if condition.

Example –

In the previous example we can add one more condition after the total statement as –

if total le 175 then delete;

This statement will delete the rows for which the value of total is less than 175.

Similarly we can select rows by using if statement without delete.

Example –

if total gt 175;

Similar to above conditions we can also compare date values with constant date value written in the form ‘ddMMMyyyy’d.

Page 39: Base SAS Programming Fundamentals

39

Concatenating SAS data sets

Steps for concatenating DATA sets –

• Use the SET statement in DATA step to concatenate SAS data sets

• Use the Rename = data set option to change the names of the variables

• Use SET and BY statements to interleave data sets.

• General form –

DATA SAS data set;

SET SAS data set1 SAS data set2;

run;

• The above code works similar to UNION in SQL query.

Page 40: Base SAS Programming Fundamentals

40

Example –

data newhires;

set n1 n2;

run;

If the number and name of fields are same in na1 and na2, then newhires will have all the fields with data from na2 following the data from na2.

If the name of fields are different then we can rename the fields using RENAME statement. E.g. if there Name, Gender, Jobcode in na1 and Name, Gender and Jcode in na2 then we can rename Jcode as Jobcode.

Page 41: Base SAS Programming Fundamentals

41

Example –

data newhires;

set na1 na2(rename=(Jcode=Jobcode));

run;

We can also interleave the resulting data set using BY statement.

data newhires;

set na1 na2 (rename=(Jcode=Jobcode));

by name;

run;

The above code orders the newhires data set by name.

Page 42: Base SAS Programming Fundamentals

42

Merging Data Sets

MERGE statement is used to merge corresponding observations from two or more data sets.

General form –

DATA SAS data set;

Merge SAS data sets;

By BY- variable;

run;

The above code will form a resulting data set having by variable filed and all the other fields and data corresponding to every common value of by variable and for different values the fields of other data sets will be having null.

So merge statement works like a join statement of SQL.

Page 43: Base SAS Programming Fundamentals

43

Conditional merging

IN= option is used to determine which data set contribute to current observation. Using this we can determine whether the join will be left or right or any other condition.

Example –

Data work.combine;

Merge ia.gercrew(in = Increw)

work.gersched(in = Inshced);

by EmpId;

if Insched=1;

run;

In= option above gives an alias to every observation of that data set and the if condition specifies that observation will be written to resulting data set if value for Inshced is not null or not missing.

Page 44: Base SAS Programming Fundamentals

44

Additional Features

In addition to one–to-one merge, there can be one to many and many to many merges.

In one to many merge, unique value of one data set has many matches in other dataset, which results in that many entries in final data set with same value for first dataset and different values for the second.

In many to many merges, many values of first dataset matches with many entries on second dataset, in this case the dataset in which extra entries are present are matched with the last entry having that value in the other dataset.

Page 45: Base SAS Programming Fundamentals

45

Summary Reports

Summary report procedures used are –

Proc Freq – Calculates frequency counts.

Proc Means – Produces simple statistics.

Proc Report – Produces flexible, detailed and summary reports.

Page 46: Base SAS Programming Fundamentals

46

Proc Freq

Proc Freq procedure displays the frequency counts of the data values in a SAS data set.

It analyzes every variable in the SAS data set.

Displays each distinct data value.

Calculates the number of observations in which each data value appears and the corresponding percentage.

Indicates for each variable how many observations have missing values.

Example –

proc freq data=ia.dfwlax;

run;

Page 47: Base SAS Programming Fundamentals

47

Features of proc freq We can limit the number of variables whose frequency we want to see.

Tables option is used to limit the number of variables. SAS creates separate frequency for each variable specified after table options separate by a space.

Example – proc freq data=ia.dfwlax;

tables economy flight;

run;

Nlevels option is used to display the number of levels in the frequency report i.e. frequency for how many values is given.

Noprint option is used for not displaying the frequency counts, it is generally used with nlevels when only number of levels is required.

Example – proc freq data=ia.dfwlax nlevels;

tables _all_ / noprint;

title ‘Number of levels’;

run;

Formats can also be used while displaying frequency reports.

Page 48: Base SAS Programming Fundamentals

48

Cross tabular frequency

A cross tabular frequency report analyzes all possible combinations of the distinct values of the two variables.

Example – proc format;

value $codefmt

‘FLTAT1’ – ‘FLTAT2’ = ‘Flight Attendant’

‘PILOT1’ – ‘PILOT2’ = ‘Pilot’;

value money

low - <25000=‘Less than 25000’

25000 – 50000=‘25,000 to 50,000’

50000 < - high = ‘More than 50000’;

run;

pro freq data=ia.crew;

tables jobcode*salary;

format jobcdoe $codefmt. salary money.;

run;

Crosslist option can be used similar to noprint for result in listing form.

Page 49: Base SAS Programming Fundamentals

49

Proc Means

This procedure gives the number observation, mean, standard deviation, minimum and maximum value for every field in the SAS data set. Additional statistics that can be obtained are range, median, sum and nmiss(number of missing values).

Var statement can be used for limited the output to some fields and Class statement can be used to categorize the output corresponding to any variable.

Example –

proc means data=ia.crew;

var salary;

class jobcode;

title ‘Salary for Job code’;

run;

Page 50: Base SAS Programming Fundamentals

50

Proc Report

Proc report enables –

Creating listing reports. Using report procedure.

Creating summary report using SUM, GROUP and ORDER statements.

Enhance reports.

Request separate subtotals and grand totals.

Extra features provided by report procedure in comparison to print procedure are –

1. Summary Report.

2. Cross tabular Report.

3. Sort data for report.

Page 51: Base SAS Programming Fundamentals

51

Report procedure

Default listing displays –

Each data value as it is store in the data set.

Variable names as report column headings

Default width for columns.

Character value as left justified.

Numeric values as right justified.

Printing selected variable –

COLUMN statement is used in order to print selected variables and in the order in which they are specified.

Example –

Title ‘Salary Analysis’;

Proc report data=ia.crew;

Column Jobcode Location Salary;

Run;

Page 52: Base SAS Programming Fundamentals

52

Define statement

Reports can be enhanced using define statement using various attributes.

General from – DEFINE variable / <attribute list>;

Functions of DEFINE statement –

1. Format variables, default format is the format stored in the SAS data set

2. Width – Width if the variables can be assigned, the default width is variable for character variables and 9 for numeric variables or the width stored in the data set.

3. Order – It orders the values of that variable in ascending order by default. Descending need to be mentioned specifically. Suppresses repetitive values.

Page 53: Base SAS Programming Fundamentals

53

Group variable – group option can be used with many variables. It is shown in the report in the order in the order in which variables are written. Order can not be used with group. This also displays the sum of numeric variables for each group, if group is not used then grand total of numeric values is displayed.

Sum – This is used to print the sum of all values.

Mean – Used for displaying mean of all the values.

N – Used for displaying the number of non missing values.

Max – Used for displaying the maximum value.

Min – Used for displaying the minimum value.

Page 54: Base SAS Programming Fundamentals

54

RBREAK

This is used for following purposes –

1. Adding grand total at the top or the bottom of the page.

2. Adding line before grand total.

3. Adding line after grand total.

General Form – RBREAK Before | After </options>;

Options –

1. Summarize – prints the total.

2. OL - Prints a single above the total.

3. DOL – Prints double line above the total.

4. UL – Prints single line below the total.

5. DUL – prints double line below the total.

Page 55: Base SAS Programming Fundamentals

55

Introduction to Graphics – Bar and Pie Charts

GCHART procedure is used to specify a chart with following features –

1. Specify the form of the chart.

2. Identify the chart variable.

3. Optionally identify an analysis variable.

General form –

Proc GCHART data =SAS data set;

HBAR/VBAR/PIE Chart variable name </Options>;

Run;

This produces chart for different values of chart variable with the length of the bar of size of the pie depending on the frequency of that value.

For numeric values SAS automatically divide into intervals and midpoints are identified and one bar for each midpoint is created. To ovoid this we can use DISCRETE option.

Page 56: Base SAS Programming Fundamentals

56

Options Contd.

SUMVAR – This specifies the summary variable against the bar variable and replaces the frequency with that variable.

TYPE – Used along with SUMVAR variable so as to specify on what basis the summary variable need to be classified for bar variable. E.g MEAN | SUM.

Example – Proc gchart data=ia.crew;

vbar Jobcode / sumvar=Salary type=mean;

run;

The above code will print a vertical bar chart with jobcode as bar

variable, whose length will be decided by mean of salary for a

particular jobcode.

FILL – This option is used with pie charts so as to specify whether to fill pie slices in a solid (FILL=S) or a cross hatched (FILL=X) patten.

EXPLODE – EXPLODE = ‘Value’, this option explodes the pie chart for that particular value.

Page 57: Base SAS Programming Fundamentals

57

Producing PLOTS

GPLOT is used to plot one variable against another variable using coordinate axis.

General Form –

Proc GPLOT data=SAS data set;

PLOT vertical variable* horizontal variable </Options>;

Run;

You can –

1. Specify the symbol to represent data.

2. Use different methods of interpolation.

3. Specify line styles, colors and thickness.

4. Draw reference lines within the axes.

5. Place one or more plot lines within the axes.

Page 58: Base SAS Programming Fundamentals

58

Example

Proc GPLOT data = ia.Flight;

where date between ‘02mar2001’d and ‘08mar2001’d;

plot Boarded * Date;

title ‘Total Passengers for flight 114’;

title2 ‘between 02mar2001 and 08mar2001’;

run;

This will plot boarded against date for the specified flight dates.

The symbol used here by default will be plus ‘+’ and values will be shown discrete without any interpolation.

Page 59: Base SAS Programming Fundamentals

59

Options

SYMBOL – Options which symbol statement can take are –

1. VALUE – It specifies the symbol for showing the values, which can be plus(default), star, diamond, square, triangle and none.

2. I – This signifies the interpolation, which can have values I= join/needle/spline.

3. Width(w) – This specifies the width of the line.

4. Color( c ) – This specifies the color of the line.

Example –

Proc GPLOT data = ia.Flight;

Plot Boarded * Date;

Symbol value=square i=join w=2 c=red;

Title ‘Total Passengers for flight 114’;

Page 60: Base SAS Programming Fundamentals

60

Controlling Axis

We can use the following options with PLOT statement –

1. HAXIS – It scales the horizontal axis.

2. VAXIS – It scales the vertical axis.

3. CAXIS – Specifies color of both the axes.

4. CTEXT – Specifies the color of text on both axes.

Example –

Plot Boarded * Date / Vaxis = 100 to 200 by 25 ctext=blue;

Page 61: Base SAS Programming Fundamentals

61

Outputting Observations

A SAS data step implicitly outputs the contents of PDV to data set, if we write an explicit output statement, it overrides the implicit output.

General form - OUTPUT <SAS data set1> <SAS data set2>…...;

Output statement can be used to –

1. Create two or more SAS observations from each line of input

2. Write observation to multiple SAS data sets.

Example –

Page 62: Base SAS Programming Fundamentals

62

Data forecast;

drop numemps;

set prog2.growth;

year=1;

Newtotal=Numemps *(1 + increase);

output;

year=2;

Newtotal=newtotal*(1 + increase);

output;

year=3;

Newtotal=newtotal*(1 + increase);

output;

Run;

Page 63: Base SAS Programming Fundamentals

63

Output statement is used to write observations to desired data sets.

Example –

data army navy airforce;

drop type;

set prog2.mlitary;

if type eq ‘Army’ then

output army;

else if type eq ‘Navy’ then

output navy;

else if type eq ‘Air force’ then

output airforce;

run;

Writing to multiple data sets

Page 64: Base SAS Programming Fundamentals

64

First Obs and Obs statements can be used to control the number of observations to be read by a dataset.

OBS statement – Set prog2.military(obs = 25); this statement selects first 25 observations from the input dataset into the output data set.

First Obs statement – Set prog2.military (firstobs=11 obs=25); this statement starts reading observations into military data set starting form 11th observation of the input data set till 25th observation.

Page 65: Base SAS Programming Fundamentals

65

Writing to an external file

Data can be written to an external file using either ODS method or FILE statement.

ODS method –

ods csvall file=‘raw – data – file’;

proc print data=prog2.maysale noobs;

format listdate

selldate date9.;

run;

ods csvall close;

File statement –

data _null_;

set prog2.maysales;

file ‘raw – data – file’;

put description

listdate ; date9.;

run;

Page 66: Base SAS Programming Fundamentals

66

_N_ and ISLAST automatic variables -

data _null_;

set prog2.maysales;

file ‘raw – data – file’;

if _N_=1 then

put ‘Description’ ‘ListDate’;

put description

listdate ; date9.;

if ISLAST = 1 then

put ‘End of data’;

run;

Specifying delimiter – DLM= option is used to specify the delimiter in the file.

Example – file ‘raw – data – file’ DLM=‘,’;

Page 67: Base SAS Programming Fundamentals

67

Summarizing data

Creating an accumulating variable – We can use RETAIN statement to create a variable having a running sum of another numeric variable.

Retain statement –

1. Retains the value of the value of the variable in the PDV across iterations of the data step.

2. Initializes retain variable to missing if no default value is specified.

Example –

data mnthtot;

set prog2.daysales;

retain mth2dte 0;

mth2dte=mth2dte+saleamt;

run;

The above code will create a new variable mth2dte having a running sum of saleamt, but if there is any missing value in saleamt then all sebsequent values of mth2dte will be missing for that we use sum statement. Sum is a replacement to retain statement.

Page 68: Base SAS Programming Fundamentals

68

Accumulating totals for a group of data

For accumulating corresponding to a particular variable, data need to be sorted by that variable first and then we can use as by variable and if statement in the following manner.

Example -

data work.divsal(keep= jcode divsal);

set work.salary;

by jcode;

if first.jcode then divsal=0;

divsal + sal;

run;

Page 69: Base SAS Programming Fundamentals

69

Reading delimited raw data file

Common delimiters used are blanks, commas and tab characters. Default delimiter is space.

For specifying the format in which SAS should read the data value. We can specify the informat name.

To specify an informat, use colon between name of the informat variable name. Colon signals SAS to read from delimiter to delimiter.

Length of the variable can also be specified in advance using length statement. Using length, we can avoid colon.

Example – data airplanes;

length ID $5;

infile ‘raw data file’;

input ID $

Inservice : date9.

passcap cargocap;

run;

Page 70: Base SAS Programming Fundamentals

70

Delimiters and missing data

DLM= option is used to specify the delimiter in the following manner

infile ‘raw data file’ dlm=‘:’;

If you specify series of delimiters in DLM option then it considers any or all of the characters as delimiter e.g. – DLM=‘:!’;

If there is missing data in the record then SAS automatically appends the next data to the previous data line. To avoid this MISSOVER option is used.

infile ‘raw data file’ dlm=‘:’ missover;

If the length of any data value is less then the specified data length then missover statement will take it as missing value, so to avoid this we use TRUNCOVER option.

infile ‘raw data file’ dlm=‘:’ missover truncover;

Two consecutive delimiters are treated as one, so to specify a missing value there should be a placeholder, which can be ‘.’ for numeric filed and blank for character field.

Page 71: Base SAS Programming Fundamentals

71

If placeholder is not present then we can use the DSD option.

Features of DSD option –

1. Sets the default delimiter to comma.

2. Treats consecutive delimiters as missing values.

3. enables SAS to read values with embedded delimiters if the value is surrounded by double quotes.

Example – infile ‘Raw data file’ dsd;

Page 72: Base SAS Programming Fundamentals

72

Controlling when a record loads

SAS loads a new record into data set when it encounters input statement.

We can also use forward slash which moves the pointer to next line.

input Lname $20. Fname $10. /

City $10. State $20.;

This code will read Lname and Fname from first line and then move to next line and start reading city and state.

#n moves the pointer to desired line.

input #1 Lname $20. Fname $10.

#2 City $10. State $20.;

This will read Lname and Fname form first line and City and State from second line. This cycle will carry on for 3rd and 4th record and so on till it reaches the end.

Page 73: Base SAS Programming Fundamentals

73

If statement can also be used to control loading of observations based on the value of any field.

Example –

input salesid 5. Location $3.;

if Location=‘USA’ then

input Saledate : mmddyy10.

Amount;

if Location=‘EUR’ then

input Saledate : date9.

Amount: comma8.;

Above code will load salesid and location first and then depending on the value of location read it will load the value of saledate and amount.

For values not satisfying any criteria saledate and amount will be blank.

Page 74: Base SAS Programming Fundamentals

74

To avoid this scenario, we can use trailing character ‘@’

Trailing option holds the raw data record in the in the input buffer until –

1. Executes an input with no trailing @ or

2. Reaches the end of data file step.

Input var1 var2 var3….@;

Reading multiple observations in one record – Multiple observations can be read into one record if we use double trailing ‘@@’.

Input var1 var2 var3…..@@:

Page 75: Base SAS Programming Fundamentals

75

Data Transformation

SAS provides a variable list, which can be used to refer to set of variables together.

Numbered range list

X1 – Xn Specifies all variables from x1 to xn inclusive. It can begin with any number and end with any number as long as rules for user supplied variables are not violated

Name range lists X - - aX –numeric-a

X-character-a

Specifies all variables from x to aSpecifies all numeric variables from x to aSpecifies all character variables from x to a

Name prefix lists Sum(of REV:) Calculates the sum of all the variables that begin with REV

Special SAS names _All__Numeric__Character_

All variables defined in a data stepAll numeric variables in a data stepAll character variables in a data step

Page 76: Base SAS Programming Fundamentals

76

SAS Functions

Substr function – Used to extract a part of string.

General form – Newvar = Substr(string, start,<length>);

Here string can be a string or a variable name, start is the start position and length is the number of characters to be extracted, if length is not written then all characters till end are extracted.

Right/Left function – Used for right justification or left justification General form - Newvar=Right(argument)

Here the argument will be right justified and the trailing blanks will be moved to start. Vice versa fro LEFT function.

Scan function – SCAN function returns the nth word of a string.

General form – Newvar= SCAN(string , n , <delimiter>);

Delimiter here can be omitted, in that case it takes blank as delimiter.

Page 77: Base SAS Programming Fundamentals

77

Concatenation operator - This operator is used to concatenate two or more strings. To concatenate, we can use either (!!) or (||).

General Form – Newvar = String1 !! String2;

Trim function – This function removes trailing blanks form the string

General form – Newvar = TRIM(argument);

If the argument is blank then it returns a blank. Trim function does not trim leading blanks, for that we can use a combination of left and trim.

Example – Fullname = trim(left(Firstname)) !! ‘ ‘ !! Lastname;

CATX function – This function concatenates character strings, removes leading and trailing and inserts separators.

General Form – CATX(separator, string 1,……,string n);

Similar to this CAT concatenates without removing blanks, CATS concatenates and removes leading and trailing blanks and CATT concatenates and removes trailing blanks only.

Page 78: Base SAS Programming Fundamentals

78

Find function – This function searches for a specific substring within a string and returns its location if found and returns 0 if not found.

General Form – Position = FIND(target,value,<modifiers>,<start>);

- Modifier can be I or T. I indicates that search is case insensitive, by default its case sensitive. T indicates that search ignores trailing blanks.

- Start identifies the start position of search, a positive value signifies forward search and a negative value signifies backward search.

Index function works same as find function except it doe not have modifier and start argument.

UPCASE function – This converts all the letters and arguments to upper case and has no effect on digits and special characters.

General Form – NewVal = UPCASE(argument);

LOWCASE function converts the text to lowercase.

PROPCASE function converts the text to proper sentence form.

Page 79: Base SAS Programming Fundamentals

79

TRANWRD function – This function translates a particular set of character in a string with other set of characters.

General Form – Desert = Tranwrd(Desert , ’Pumpkin’ , ’Apple’);

This replaces Pumpkin with apple in desert. If the length of replacing string is greater than replaced string then it causes truncation of string if length is not specified.

SUBSTR left side – If substr function is used of the left side of the assignment statement then it replaces that substring in the text with the substring on right.

General Form – SUBSTR(string , start , <length>)=value;

Page 80: Base SAS Programming Fundamentals

80

Manipulating numeric values

Round function - This function returns a rounded off value to the nearest unit.

General Form – NewVar = ROUND(arguments,<round off unit>);

Round off unit is numeric and positive. It indicates how many places need to rounded off.

CEIL function – This function returns the smallest integer greater than or equal to the argument.

Floor function – This function returns the greatest integer less than or equal to the argument.

INT function – This function returns the integer part of the argument.

MEAN function – This returns the mean of all the arguments.

MIN function – This returns the minimum no missing value.

MAX function - This returns the maximum value.

Page 81: Base SAS Programming Fundamentals

81

Manipulating Date values Creating SAS date value – MDY function returns SAS date from date,

month and year given separately.

General Form - Newdate=MDY(month,date,year);

TODAY() – This function returns the system date.

Extracting information – We can extract day , month or year from SAS date using DAY(SAS date ), MONTH(SAS date) or YEAR(SAS date) respectively. Similarly we can use QTR and WEEKDAY.

Calculating Interval of Years– YRDIF function calculates year difference between two SAS dates.

General Form – Diff= YRDIF(sdate , edate , basis)

Basis can take following values –

1. ‘ACT/ACT’ – This calculates the actual difference in fraction.

2. ’30/360’ – Specifies 30 day month and 360 days year.

3. ‘ACT/360’ – Takes actual number of days and divides it by 360.

4. ‘ACT/365’ – Takes actual number of days and divides it by 365.

Page 82: Base SAS Programming Fundamentals

82

Converting variable type

INPUT statement is used to convert character value to numeric value.

General Form – Numvar=INPUT(source,informat)

In above data conversion, the assigned variable cannot be same as converted variable, assigned and converted variable name cannot be the original name and rename of same variable.

PUT statement is used to convert numeric value to character value.

General Form – Charvar=PUT(Source,format);

Same rules as above apply to PUT function also. Format can be any valid character format.

Page 83: Base SAS Programming Fundamentals

83

Automatic conversions

Automatic conversion from character to numeric is done in following cases –

1. Assignment to a numeric variable.

2. An arithmetic operation.

3. Logical comparison with a numeric value.

4. A function that takes a numeric argument.

5. It produces a numeric missing value if it does not confirm to standard numeric convention.

Automatic numeric to character conversion is done in following manner –

1. Assignment to a character variable.

2. A concatenation operation.

3. A function that accepts character arguments.

Page 84: Base SAS Programming Fundamentals

84

Do loop Processing

Do loop is used to eliminate the redundant data and perform repetitive work.

General Form – DO index-variable = start TO stop <BY increment>;

End;

Example- Data invest;

do year = 2001 to 2003;

Capital + 5000;

Capital + (Capital * .075);

end;

run;

The above code will write the final value of Capital into the data set.

If we write output; before the end of do loop then it will write all the intermediate values of Capital in the data set.

Page 85: Base SAS Programming Fundamentals

85

Do While loop – This is used for conditional iteration of a set of statements.

General form – DO WHILE(expression);

END;

Statement is executed first, if true then only loop is executed.

Do Until loop - This is used for conditional iteration of a set of statements.

General form – DO UNTIL(expression);

END;

Statement is executed first, if not true then also once loop is executed.

Combining Do WHILE and DO UNTIL with DO – This method is used to avoid infinite loop.

DO index variable = start TO stop <BY variable>;

WHILE | UNTIL (expression);

END;

Page 86: Base SAS Programming Fundamentals

86

Nested Do loops

Rules for nesting Do loops are –

1. Use different iteration variable for all the Do loops.

2. Make sure that every DO has a corresponding END.

Example – Data invest;

Do Year = 1 to 5;

Capital + 5000;

Do Quarter = 1 to 4;

Capital + (Capital * (.075/4));

End;

Output;

End;

Page 87: Base SAS Programming Fundamentals

87

SAS arrays

Creating variables with arrays –

Example -

Data percent (drop = qtr);

Set donate;

Total = sum(of qtr1 – qtr 4);

array contrib(4) qtr1 – qtr4;

array percent(4);

do qtr=1 to 4;

percent(qtr)=contrib(qtr)/total;

end;

run;

In the above code, contrib takes the value of qtr1 to qtr4 and percent is an empty array. We can also format the array variable while declaration.

Example - var ID Percetn1 – Percent4;

Format percent1 – percent4 percent6.;

Percentw.d fromat multiplies value by 100 and adds a % sign at the end

Page 88: Base SAS Programming Fundamentals

88

Assigning initial values Example –

data compare(drop = qtr goal1 – goal4);

set donate;

array contrib(4) qtr1 – qtr4;

array diff(4);

array goal(4) goal1 – goal4 (10,15,5,10);

do qtr=1 to 4;

diff(qtr) = contrib(qtr) – goal(qtr);

end;

run;

The above code takes the value of existing variable qtr1 –qtr4 into contrib, assigns values to new array goal with variable names goal1 to goal4 and calculates value for diff array. Initial values are retained until new values are assigned and in case of less values then array length, rest of the variables are set as having missing value.

Page 89: Base SAS Programming Fundamentals

89

Temporary arrays

Temporary can be created if we an array for calculation purpose, e.g. – in the previous example, array goal is an intermediate array and it is not required in the output data set.

For that we can use _TEMPORARY_ instead of variable name

Example – array Goal _temporary_ (10,15,5,10);

Page 90: Base SAS Programming Fundamentals

90

Rotating SAS data set

ID QTR1 QTR2 QTR3 QTR4

E00224 12 33 22

E00367 35 48 40 30

ID QTR Amount

E00224 1 12

E00224 2 33

E00224 3 22

E00224 4

E00367 1 35

E00367 2 48

E00367 3 40

E00367 4 30

Input Data Set

Output Data Set

Page 91: Base SAS Programming Fundamentals

91

SAS Program for rotation

Data rotate(drop = Qtr1 – Qtr4);

Set donate;

array Contrib(4) Qtr1 – Qtr4;

do Qtr=1 to 4;

Amount = Contrib(qtr);

Output;

end;

run;

For every observation read from rotate data set in above code, there will be values coming into contrib from Qtr1 – Qtr4. Now inside the loop these values inside contrib will be assigned to amount one by one in every iteration and every time these values will be written into the output data set along with vale of Qtr variable.

Page 92: Base SAS Programming Fundamentals

92

Conditional match merging of SAS data sets

If we have two data sets transact having account number information for the week, having account number, transaction type and amount as fields and a branches data set having account number and branch location for that account.

Our objective is to create three datasets.

Newtrans having weeks transactions with fields account number transaction type, amount and branch.

Noactiv showing accounts with no transaction this week with fields account number and branch

Noacct showing accounts with non matching account number, with fields account number, transaction type and amount.

Page 93: Base SAS Programming Fundamentals

93

Solution

Data Newtrans

Noactiv(drop = trans amt)

Noact(drop = branch);

Merge transact(IN = Intrans)

Branches(IN = InBanks);

By actnum;

If Intrans and Inbanks

Then output Newtrans;

Else if Inbanks and not InTrans

then output Noactiv;

Else If Intrans and not Inbanks

then output Noacct;

Run;

Page 94: Base SAS Programming Fundamentals

94

Writing SQL queries in SAS data set

We can use SQL queries in SAS by enclosing them in PROC SQL; and QUIT;

While joining two data sets using an SQL query the data sets need not be sorted contrary to MERGE command in SAS where the input data sets need to be sorted by the BY variable.

Example –

Proc SQL;

Select T.Actnum, T.Trans, T.Amt, B.Branch

from Transact T , Branches B

where T.Actnum = B.Actnum;

Quit;

No RUN command is required for an SQL query.

Page 95: Base SAS Programming Fundamentals

95

SAS Macros

Macros construct input for the SAS compiler.

Functions of the SAS macro processor:

• pass symbolic values between SAS statements and steps

• establish default symbolic values

• conditionally execute SAS steps

• invoke very long, complex code in a quick, short way.

Page 96: Base SAS Programming Fundamentals

96

Advantages of SAS macros -

• substitute text in statements like TITLEs

• communicate across SAS steps

• establish default values

• conditionally execute SAS steps

• hide complex code that can be invoked easily.

Page 97: Base SAS Programming Fundamentals

97

Components of SAS macrosMacro variables:• used to store and manipulate character strings• follow SAS naming rules• are NOT the same as DATA step variables• are stored in memory in a macro symbol table.Macro statements:• begin with a % and a macro keyword and end with semicolon (;)• assign values, substitute values, and change macro variables• can branch or generate SAS statements conditionally.

Page 98: Base SAS Programming Fundamentals

98

Automatic macro variables

Some of the automatic macro variables are –

SYSDATE – Current date in date7. format.

SYSDAY – Current day of week.

SYSDSN/SYSLAST – Last dataset built.

These are the most commonly used macro variables.

Example –

footnote "this report was run on &SYSDAY, &SYSDATE";

The above code resolves to –

footnote "this report was run on Friday, 25jul08";

Page 99: Base SAS Programming Fundamentals

99

Displaying macro variables

%PUT is used to display macro variables on the log.

Example –

%PUT **** SYSDAY = &SYSDAY;

%PUT **** SYSTIME = &SYSTIME;

%PUT **** SYSDATE = &SYSDATE;

The above code prints –

**** SYSDAY = Friday

**** SYSTIME = 13:42

**** SYSDATE = 25JUL08

Example of proc print using macro variable –

proc contents data=&SYSLAST;

title "contents of &SYSLAST";

run;

Page 100: Base SAS Programming Fundamentals

100

User defined macro variables

Macro variables can be defined by using %LET statement.

General form - %LET var_name = value;

This variable can be used anywhere using a ‘&’ sign.

Example –

%LET NAME=PAYROLL;

PROC PRINT DATA=&NAME;

TITLE "PRINT OF DATASET &NAME";

RUN;

The above code will substitute NAME with PAYROLL in the proc print procedure and prints the data set.

% STR allows values with semicolon (;) .

Example - %LET CHART=%STR(PROC CHART;VBAR EMP;RUN;);

&CHART;

Page 101: Base SAS Programming Fundamentals

101

Defining and Using Macros

%MACRO and %MEND can be used to define macros.

%Macro name can be used to use or call macros.

Example –

%MACRO CHART;

PROC CHART DATA=&NAME;

VBAR EMP;

RUN;

%MEND;

%CHART;

%CHART will invoke the macro and run the code inside the definition of the macro.

Page 102: Base SAS Programming Fundamentals

102

Parameterized Macro

Example –

%MACRO CHART(NAME,BARVAR);

PROC CHART DATA=&NAME;

VBAR &BARVAR;

RUN;

%MEND;

%CHART(PAYROLL,EMP);

The above macro resolves to –

PROC CHART DATA=PAYROLL;

VBAR EMP;

RUN;

Page 103: Base SAS Programming Fundamentals

103

Conditional Macro

%IF and %DO can be used inside macro to execute a set of steps conditionally.

Example –

%MACRO PTCHT(PRTCH,NAME,BARVAR);

%IF &PRTCH=YES %THEN

%DO;

PROC PRINT DATA=&NAME;

TITLE "PRINT OF DATASET &NAME";

RUN;

END;

PROC CHART DATA=&NAME;

VBAR &BARVAR;

RUN;

%MEND;

%PTCHT(YES,PAYROLL,EMP)

Page 104: Base SAS Programming Fundamentals

104

Transferring values between SAS steps

SYMGET and SYMPUT can be used to transfer values between data steps or proc steps.

Example –

%MACRO OBSCOUNT(NAME);

DATA _NULL_;

SET &NAME NOBS=OBSOUT;

CALL SYMPUT('MOBSOUT',OBSOUT);

STOP;

RUN;

PROC PRINT DATA=&NAME;

TITLE "DATASET &NAME CONTAINS &MOBSOUT OBSERVATIONS";

RUN;

%MEND;

%OBSCOUNT(PAYROLL);

Page 105: Base SAS Programming Fundamentals

105

Efficiency Techniques

• Selecting observations – Comparison between In, or and where operator while selecting.

• Reducing observation length – Comparison between SCAN and SUBSTR function in terms of disk space usage.

• Indexing – Usage of index in a where statement as compared to if statement.

• Compressing – Making a data set form another sorted data set in different cases of whether input is compressed or the output.

• Sub setting external files – Usage of if statement at different stages while sub setting an external file.

• Concatenating data sets – Comparison between simple concatenations, append, insert into in SQL and union functions.

• Interleaving data sets - Using sort function separately, by function and order by in union.

• Selecting observations – Comparison between In, or and where operator while selecting.

• Reducing observation length – Comparison between SCAN and SUBSTR function in terms of disk space usage.

• Indexing – Usage of index in a where statement as compared to if statement.

• Compressing – Making a data set form another sorted data set in different cases of whether input is compressed or the output.

• Sub setting external files – Usage of if statement at different stages while sub setting an external file.

• Concatenating data sets – Comparison between simple concatenations, append, insert into in SQL and union functions.

• Interleaving data sets - Using sort function separately, by function and order by in union.

105

Page 106: Base SAS Programming Fundamentals

106

Selecting Observations

106

When we want to test for different values of a variable using the IF statement, we can choose between the IN operator or the OR operator. The examples below show that the IN operator requires more CPU time. The difference becomes even more important when testing huge set of records.

When we want to test for different values of a variable using the IF statement, we can choose between the IN operator or the OR operator. The examples below show that the IN operator requires more CPU time. The difference becomes even more important when testing huge set of records.

PROGRAM 1-ADATA PRODUCTSALES;SET DATA1.SALES;WHERE PRODUCT_ID IN ('111', '142', '152','165', '166');Run;

PROGRAM 1-ADATA PRODUCTSALES;SET DATA1.SALES;WHERE PRODUCT_ID IN ('111', '142', '152','165', '166');Run;

PROGRAM 1-BDATA PRODUCTSALES;SET DATA1.SALES;IF PRODUCT_ID = '111' ORPRODUCT_ID = '142' ORPRODUCT_ID = '152' ORPRODUCT_ID = '165' ORPRODUCT_ID = '166';RUN;

PROGRAM 1-BDATA PRODUCTSALES;SET DATA1.SALES;IF PRODUCT_ID = '111' ORPRODUCT_ID = '142' ORPRODUCT_ID = '152' ORPRODUCT_ID = '165' ORPRODUCT_ID = '166';RUN;

Page 107: Base SAS Programming Fundamentals

107

PROGRAM 1-CDATA PRODUCTSALES;SET DATA1.SALES;WHERE PRODUCT_ID IN ('111', '142', '152','165', '166', '411','412', '417', '421','423', '519', '525','526', '733', '736');RUN;

PROGRAM 1-CDATA PRODUCTSALES;SET DATA1.SALES;WHERE PRODUCT_ID IN ('111', '142', '152','165', '166', '411','412', '417', '421','423', '519', '525','526', '733', '736');RUN;

PROGRAM 1-DDATA PRODUCTSALES;SET DATA1.SALES;IF PRODUCT_ID = '111' ORPRODUCT_ID = '142' ORPRODUCT_ID = '152' ORPRODUCT_ID = '165' ORPRODUCT_ID = '166' ORPRODUCT_ID = '411' ORPRODUCT_ID = '412' ORPRODUCT_ID = '417' ORPRODUCT_ID = '421' ORPRODUCT_ID = '423' ORPRODUCT_ID = '519' ORPRODUCT_ID = '525' ORPRODUCT_ID = '526' ORPRODUCT_ID = '733' ORPRODUCT_ID = '736';RUN;

PROGRAM 1-DDATA PRODUCTSALES;SET DATA1.SALES;IF PRODUCT_ID = '111' ORPRODUCT_ID = '142' ORPRODUCT_ID = '152' ORPRODUCT_ID = '165' ORPRODUCT_ID = '166' ORPRODUCT_ID = '411' ORPRODUCT_ID = '412' ORPRODUCT_ID = '417' ORPRODUCT_ID = '421' ORPRODUCT_ID = '423' ORPRODUCT_ID = '519' ORPRODUCT_ID = '525' ORPRODUCT_ID = '526' ORPRODUCT_ID = '733' ORPRODUCT_ID = '736';RUN;

Page 108: Base SAS Programming Fundamentals

108

Program number Method used and size of data CPU time elapsed

1-A 5 records – IN operator 1.94 sec

1-B 5 values – OR operator 0.80 sec

1-C 15 records – IN operator 3.92 sec

1-D 15 records – OR operator 0.90 sec

Comparison on the basis of time Comparison on the basis of time

Page 109: Base SAS Programming Fundamentals

109

PROGRAM 2-A

DATA CLIENT;

SET DATA1.CLIENT;

IF LAST_NAME = ‘VAN BRUSSELS’;

RUN;

PROGRAM 2-BDATA CLIENT;SET DATA1.CLIENT;WHERE LAST_NAME = ‘VAN BRUSSELS’;RUN;

PROGRAM 2-BDATA CLIENT;SET DATA1.CLIENT;WHERE LAST_NAME = ‘VAN BRUSSELS’;RUN;

Sub setting data in a DATA step is possible through the IF statement or the WHERE statement. Usually the WHERE statement is more efficient than the IF statement, because the IF statement is executed on the data, being in the Program Data Vector, whereas the WHERE statement is executed before bringing the data in the Program Data Vector. The following examples show this behavior.

Sub setting data in a DATA step is possible through the IF statement or the WHERE statement. Usually the WHERE statement is more efficient than the IF statement, because the IF statement is executed on the data, being in the Program Data Vector, whereas the WHERE statement is executed before bringing the data in the Program Data Vector. The following examples show this behavior.

Page 110: Base SAS Programming Fundamentals

110

PROGRAM 2-CDATA CLIENT;SET DATA1.CLIENT;IF SUBSTR (LAST_NAME, 1, 3) = 'VAN';RUN;

PROGRAM 2-CDATA CLIENT;SET DATA1.CLIENT;IF SUBSTR (LAST_NAME, 1, 3) = 'VAN';RUN;

PROGRAM 2-DDATA CLIENT;SET DATA1.CLIENT;WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';RUN;

PROGRAM 2-DDATA CLIENT;SET DATA1.CLIENT;WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';RUN;

PROGRAM 2-EDATA CLIENT;SET DATA1.CLIENT;WHERE LAST_NAME LIKE 'VAN%';RUN;

PROGRAM 2-EDATA CLIENT;SET DATA1.CLIENT;WHERE LAST_NAME LIKE 'VAN%';RUN;

Although there is an exception in where statement too. The above examples show that using the SUBSTR function in a WHERE statement increases the CPU time incredibly compared to the corresponding IF statement. When using a typical WHERE operand (LIKE), the same subset is created, but CPU time decreases and gives a better performance again compared to the sub setting IF statement.

Although there is an exception in where statement too. The above examples show that using the SUBSTR function in a WHERE statement increases the CPU time incredibly compared to the corresponding IF statement. When using a typical WHERE operand (LIKE), the same subset is created, but CPU time decreases and gives a better performance again compared to the sub setting IF statement.

Page 111: Base SAS Programming Fundamentals

111

Comparison on the basis of time

Program number Method used CPU time elapsed (seconds)

2-A IF 0.90

2-B Where 0.07

2-C IF – SUBSTR 0.11

2-D Where – SUBSTR 0.22

2-E Where – LIKE 0.09

Page 112: Base SAS Programming Fundamentals

112

Reducing Observation Length

Several data manipulation functions have ‘space leaks’: If LENGTH statement is not specified to identify the resulting variable, a lot of disk space might be wasted. Two examples illustrate this behavior. Within the first example the variable INITIALS contains the output of the SUBSTR function, but the length of this variable equals the sum of the contributing variables. As a result, every observation in the output table contains (length of first name + length of last name - 2) redundant blanks. Let us assume that the length of first name and last name is 20 each in that case every initials will have 38 redundant blanks.

Several data manipulation functions have ‘space leaks’: If LENGTH statement is not specified to identify the resulting variable, a lot of disk space might be wasted. Two examples illustrate this behavior. Within the first example the variable INITIALS contains the output of the SUBSTR function, but the length of this variable equals the sum of the contributing variables. As a result, every observation in the output table contains (length of first name + length of last name - 2) redundant blanks. Let us assume that the length of first name and last name is 20 each in that case every initials will have 38 redundant blanks.

PROGRAM 1-ADATA CLIENT;SET DATA1.CLIENT;INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!SUBSTR (LAST_NAME, 1, 1);RUN;

PROGRAM 1-ADATA CLIENT;SET DATA1.CLIENT;INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!SUBSTR (LAST_NAME, 1, 1);RUN;

PROGRAM 1-BDATA CLIENT;SET DATA1.CLIENT;LENGTH INITIALS $ 2;INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!SUBSTR (LAST_NAME, 1, 1);RUN;

PROGRAM 1-BDATA CLIENT;SET DATA1.CLIENT;LENGTH INITIALS $ 2;INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!SUBSTR (LAST_NAME, 1, 1);RUN;

Page 113: Base SAS Programming Fundamentals

113

Some functions – like the SCAN function – create a result with a default length of 200, being the maximum length of a character variable. Following is an example of space wastage in that case.

Some functions – like the SCAN function – create a result with a default length of 200, being the maximum length of a character variable. Following is an example of space wastage in that case.

PROGRAM 1-CDATA CLIENT;SET DATA1.CLIENT;COUNTRY = SCAN (CLIENT_ID, 1, '-');CITY = SCAN (CLIENT_ID, 2, '-');NUMBER = SCAN (CLIENT_ID, 3, '-');RUN;

PROGRAM 1-CDATA CLIENT;SET DATA1.CLIENT;COUNTRY = SCAN (CLIENT_ID, 1, '-');CITY = SCAN (CLIENT_ID, 2, '-');NUMBER = SCAN (CLIENT_ID, 3, '-');RUN;

PROGRAM 1-DDATA CLIENT;SET DATA1.CLIENT;LENGTH COUNTRY CITY $ 2NUMBER $ 8;COUNTRY = SCAN (CLIENT_ID, 1, '-');CITY = SCAN (CLIENT_ID, 2, '-');NUMBER = SCAN (CLIENT_ID, 3, '-');RUN;

PROGRAM 1-DDATA CLIENT;SET DATA1.CLIENT;LENGTH COUNTRY CITY $ 2NUMBER $ 8;COUNTRY = SCAN (CLIENT_ID, 1, '-');CITY = SCAN (CLIENT_ID, 2, '-');NUMBER = SCAN (CLIENT_ID, 3, '-');RUN;

Page 114: Base SAS Programming Fundamentals

114

Program number Method used Length of variables in different cases

1-A SUBSTR 20 + 20

1-B SUBSTR – Length 2

1-C SCAN 3 x 200 = 600

1-D SCAN – Length 2 + 2 + 8 = 12

Comparison on the basis of sizeComparison on the basis of size

Page 115: Base SAS Programming Fundamentals

115

IndexingIndexing

Although an index is considered for use in a WHERE statement and not in a sub setting IF statement, we still find several programs using an IF statement to subset a table with an index. The gain in CPU time becomes more important if the subset returned by the index is smaller. In the following examples, a simple index exists on the variables SHOP_ID and CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable CUSTOMER_ID contains approximately 80.000 different values. Accessing the data through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small difference between the WHERE statement (using the index) and the IF statement (performing a sequential search).

Although an index is considered for use in a WHERE statement and not in a sub setting IF statement, we still find several programs using an IF statement to subset a table with an index. The gain in CPU time becomes more important if the subset returned by the index is smaller. In the following examples, a simple index exists on the variables SHOP_ID and CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable CUSTOMER_ID contains approximately 80.000 different values. Accessing the data through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small difference between the WHERE statement (using the index) and the IF statement (performing a sequential search).

PROGRAM 1-ADATA SALES_B_B;SET DATA1.SALES_INDEXED;IF SHOP_ID = 'B-B';RUN;

PROGRAM 1-ADATA SALES_B_B;SET DATA1.SALES_INDEXED;IF SHOP_ID = 'B-B';RUN;

PROGRAM 1-BDATA SALES_B_B;SET DATA1.SALES_INDEXED;WHERE SHOP_ID = 'B-B';RUN;

PROGRAM 1-BDATA SALES_B_B;SET DATA1.SALES_INDEXED;WHERE SHOP_ID = 'B-B';RUN;

Page 116: Base SAS Programming Fundamentals

116

Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the data and is extremely fast compared to the sub setting IF statement.

Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the data and is extremely fast compared to the sub setting IF statement.

PROGRAM 2-ADATA SALES_12345;SET DATA1.SALES_INDEXED;IF CUSTOMER_ID = ‘12345';RUN;

PROGRAM 2-ADATA SALES_12345;SET DATA1.SALES_INDEXED;IF CUSTOMER_ID = ‘12345';RUN;

PROGRAM 2-BDATA SALES_12345;SET DATA1.SALES_INDEXED;WHERE CUSTOMER_ID = ‘12345';RUN;

PROGRAM 2-BDATA SALES_12345;SET DATA1.SALES_INDEXED;WHERE CUSTOMER_ID = ‘12345';RUN;

Page 117: Base SAS Programming Fundamentals

117

Program Number Description CPU Time(seconds)

1-A 7 shops – If 1.31

1-B 7 shops – Where 1.02

2-A 100.00 Clients – If 0.76

2-B 100.00 clients – Where 0.01

Comparison on the basis on timeComparison on the basis on time

Page 118: Base SAS Programming Fundamentals

118

CompressingCompressing

Compression can be useful if disk space is a problem. Compression must be added in a sensible way: Both compressing the data and decompressing the data requires CPU time. COMPRESS = YES option in the global OPTIONS statement should not be specified. The following examples illustrate the CPU cost of compression: an input SAS data set is sorted into an output SAS data set.

Compression can be useful if disk space is a problem. Compression must be added in a sensible way: Both compressing the data and decompressing the data requires CPU time. COMPRESS = YES option in the global OPTIONS statement should not be specified. The following examples illustrate the CPU cost of compression: an input SAS data set is sorted into an output SAS data set.

PROGRAM 1-APROC SORT DATA = DATA1.CLIENTOUT = CLIENT;BY HOME_CITY;RUN;

PROGRAM 1-APROC SORT DATA = DATA1.CLIENTOUT = CLIENT;BY HOME_CITY;RUN;

PROGRAM 1-BPROC SORT DATA = DATA1.CLIENTOUT = CLIENT_COMPRRESSED(COMPRESS = YES);BY HOME_CITY;RUN;

PROGRAM 1-BPROC SORT DATA = DATA1.CLIENTOUT = CLIENT_COMPRRESSED(COMPRESS = YES);BY HOME_CITY;RUN;

PROGRAM 1-CPROC SORT DATA = DATA1.CLIENT_COMPRESSEDOUT = CLIENT;BY HOME_CITY;RUN;

PROGRAM 1-CPROC SORT DATA = DATA1.CLIENT_COMPRESSEDOUT = CLIENT;BY HOME_CITY;RUN;

PROGRAM 1-DPROC SORT DATA = DATA1.CLIENT_COMPRESSEDOUT = CLIENT_COMPRESSED(COMPRESS = YES);BY HOME_CITY;RUN;

PROGRAM 1-DPROC SORT DATA = DATA1.CLIENT_COMPRESSEDOUT = CLIENT_COMPRESSED(COMPRESS = YES);BY HOME_CITY;RUN;

Page 119: Base SAS Programming Fundamentals

119

Program Description CPU Time (seconds)

1-A Input not compressedOutput not compressed

0.51

1-B Input not compressedOutput compressed

0.78

1-C Input compressedOutput not compressed

0.48

1-D Input compressedOutput compressed

0.80

Comparison on the basis of timeComparison on the basis of time

Page 120: Base SAS Programming Fundamentals

120

Sub setting external filesSub setting external files

The INPUT statement, structuring the input buffer’s content into variables in the Program Data Vector will consume quite some CPU time. If you only need to process a subset of the external file, only examine part of the input buffer, and if this part meets your sub setting condition, examine the rest of the input buffer. The trailing @ in the INPUT statement allows holding contents the input buffer.

The INPUT statement, structuring the input buffer’s content into variables in the Program Data Vector will consume quite some CPU time. If you only need to process a subset of the external file, only examine part of the input buffer, and if this part meets your sub setting condition, examine the rest of the input buffer. The trailing @ in the INPUT statement allows holding contents the input buffer.

PROGRAM 1-ADATA CLIENT;INFILE CLIENT;INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_CITY $ 58 - 77HOME_COUNTRY $ 79 - 93…;RUN;DATA CLIENT_LONDON;SET CLIENT;IF HOME_CITY = 'LONDON';RUN;

PROGRAM 1-ADATA CLIENT;INFILE CLIENT;INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_CITY $ 58 - 77HOME_COUNTRY $ 79 - 93…;RUN;DATA CLIENT_LONDON;SET CLIENT;IF HOME_CITY = 'LONDON';RUN;

Page 121: Base SAS Programming Fundamentals

121

PROGRAM 1-BDATA CLIENT_LONDON;INFILE CLIENT;INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_CITY $ 58 - 77HOME_COUNTRY $ 79 - 93…;IF HOME_CITY = 'LONDON';RUN;

PROGRAM 1-BDATA CLIENT_LONDON;INFILE CLIENT;INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_CITY $ 58 - 77HOME_COUNTRY $ 79 - 93…;IF HOME_CITY = 'LONDON';RUN;

PROGRAM 1-CDATA CLIENT_LONDON;INFILE CLIENT;INPUT HOME_CITY $ 58 - 77 @;IF HOME_CITY = 'LONDON';INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_COUNTRY $ 79 - 93…;RUN;

PROGRAM 1-CDATA CLIENT_LONDON;INFILE CLIENT;INPUT HOME_CITY $ 58 - 77 @;IF HOME_CITY = 'LONDON';INPUT CLIENT_ID $ 1 - 14LAST_NAME $ 16 - 35FIRST_NAME $ 37 - 56HOME_COUNTRY $ 79 - 93…;RUN;

Page 122: Base SAS Programming Fundamentals

122

Program number Description CPU Time(minutes)

1-A DATA (Input) – DATA (If) 4:22.80

1-B DATA (Input – If) 2:25.98

1-C DATA (Input – If – Input) 0:15.91

Comparison on the basis on timeComparison on the basis on time

Page 123: Base SAS Programming Fundamentals

123

EFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETSEFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETS

Many users are familiar with the APPEND procedure for adding a new table immediately to a master table, without reading / writing the master table. Still, they rarely code the APPEND procedure, because they are used to typing the DATA step, which is coded very fast. In the next example the traditional DATA step concatenation capabilities are compared with using the OUTER UNION CORR operator in the SQL procedure. The result can also be created using the SQL INSERT statement to add all observations of the second table to the end of the master table.

Many users are familiar with the APPEND procedure for adding a new table immediately to a master table, without reading / writing the master table. Still, they rarely code the APPEND procedure, because they are used to typing the DATA step, which is coded very fast. In the next example the traditional DATA step concatenation capabilities are compared with using the OUTER UNION CORR operator in the SQL procedure. The result can also be created using the SQL INSERT statement to add all observations of the second table to the end of the master table.

PROGRAM 1-ADATA SALES;SET SALES DATA1.SALES2003;RUN;

PROGRAM 1-ADATA SALES;SET SALES DATA1.SALES2003;RUN;

PROGRAM 1-BPROC APPEND BASE = SALESDATA = DATA1.SALES2003;RUN;

PROGRAM 1-BPROC APPEND BASE = SALESDATA = DATA1.SALES2003;RUN;

PROGRAM 1-CPROC SQL;INSERT INTO SALESSELECT * FROM DATA1.SALES2003;QUIT;

PROGRAM 1-CPROC SQL;INSERT INTO SALESSELECT * FROM DATA1.SALES2003;QUIT;

PROGRAM 1-DPROC SQL;CREATE TABLE SALES ASSELECT *FROM SALESOUTER UNION CORRSELECT *FROM DATA1.SALES2003;QUIT;

PROGRAM 1-DPROC SQL;CREATE TABLE SALES ASSELECT *FROM SALESOUTER UNION CORRSELECT *FROM DATA1.SALES2003;QUIT;

Page 124: Base SAS Programming Fundamentals

124

Program Number Description CPU Time(seconds)

1-A DATA (Set) 1.65

1-B Append 0.11

1-C SQL (Insert into) 0.59

1-D SQL (Outer union core) 3.98

Comparison on the basis of timeComparison on the basis of time

Page 125: Base SAS Programming Fundamentals

125

Interleaving datasetInterleaving dataset

You can concatenate two sorted input SAS data sets into a sorted result in several ways. The following example compares the traditional DATA step followed by a SORT procedure with a BY statement immediately specified in the DATA step and with the OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As expected the SQL procedure requires more CPU time than the DATA step.

You can concatenate two sorted input SAS data sets into a sorted result in several ways. The following example compares the traditional DATA step followed by a SORT procedure with a BY statement immediately specified in the DATA step and with the OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As expected the SQL procedure requires more CPU time than the DATA step.

PROGRAM 1-ADATA SALES;SET DATA1.SALES_B DATA1.SALES_NL;RUN;PROC SORT DATA = SALES;BY SALES_DATE;RUN;

PROGRAM 1-ADATA SALES;SET DATA1.SALES_B DATA1.SALES_NL;RUN;PROC SORT DATA = SALES;BY SALES_DATE;RUN;

PROGRAM 1-BDATA SALES;SET DATA1.SALES_B DATA1.SALES_NL;BY SALES_DATE;RUN;

PROGRAM 1-BDATA SALES;SET DATA1.SALES_B DATA1.SALES_NL;BY SALES_DATE;RUN;

PROGRAM 1-CPROC SQL;CREATE TABLE SALES ASSELECT *FROM DATA1.SALES_BOUTER UNION CORRSELECT *FROM DATA1.SALES_NLORDER BY SALES_DATE;QUIT;

PROGRAM 1-CPROC SQL;CREATE TABLE SALES ASSELECT *FROM DATA1.SALES_BOUTER UNION CORRSELECT *FROM DATA1.SALES_NLORDER BY SALES_DATE;QUIT;

Page 126: Base SAS Programming Fundamentals

126

Program Number Description CPU Time(seconds)

1-A DATA (Set) - Sort 6.15

1-B DATA (Set – By) 2.10

1-C SQL (Outer Union Corr – Order By) 11.32

Comparison on the basis on timeComparison on the basis on time