1 Ron Briggs-UTD Introduction to SAS a programming environment and language for data manipulation and analysis. Files referenced here are available in the Green Lab at: P:\briggs\poec5317 Single best book on SAS: Lora D. Delwiche and Susan Slaughter “The Little SAS Book: A Primer,” SAS Institute, 2 nd edition, 1999
24
Embed
1 Ron Briggs-UTD Introduction to SAS a programming environment and language for data manipulation and analysis. Files referenced here are available in.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Ron Briggs-UTD
Introduction to SAS
a programming environment and language for data manipulation and analysis.
Files referenced here are available in the Green Lab at:
P:\briggs\poec5317
Single best book on SAS: Lora D. Delwiche and Susan Slaughter “The Little SAS Book: A Primer,” SAS Institute, 2nd edition, 1999
2Ron Briggs-UTD
First Example ProgramSample SAS program: sasex0.sas
*my first SAS program;
data grocery;
input product $ var1 var2;
label var1='Number on Hand'
var2='Number on Order';
lines;
cheese 25 32
hotdogs 26 14
mustard 13 32
;;;;;;;;
run;
proc print label; run;
proc means; run;
The SAS System 1
Number NumberOBS PRODUCT on Hand on Order
1 cheese 25 32 2 hotdogs 26 14 3 mustard 13 32
The SAS System 19:13 Tuesday, July 16, 1996 2
Variable Label N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------------------------------VAR1 Number on Hand 3 21.3333333 7.2341781 13.0000000 26.000000VAR2 Number on Order 3 26.0000000 10.3923048 14.0000000 32.000000-------------------------------------------------------------------------------------------------------
Output file (‘printout’)Program File
3Ron Briggs-UTD
SAS Log from First Sample Program: sasex0.log
(with minor editing)
1 *my first SAS program;
3 data grocery;
4 input product $ var1 var2;
5 label var1='Number on Hand'
6 var2='Number on Order';
7 lines;
NOTE: The data set WORK.GROCERY has 3 observations and 3 variables.
NOTE: DATA statement used:
real time 0.148 seconds
cpu time 0.070 seconds
11 ;;;;;;;;
12 run;
13 proc print label; run;
NOTE: The PROCEDURE PRINT printed page 1.
14 proc means; run;
NOTE: The PROCEDURE MEANS printed page 2.
4Ron Briggs-UTD
Second Sample ProgramHere is Data! How Get it into SAS?
Alabama 1AL ESC 63 3894 3444 3267Alaska 2AK PAC 94 402 304 228Arizona 4AZ MTN 86 2718 1776 1306Arkansas 5AR WSC 71 2286 1924 1787California 6CA PAC 93 23668 19971 15735Colorado 8CO MTN 84 2890 2211 1758Connecticut 9CT NE 16 3108 3033 2536Delaware 10DE SA 51 594 547 445D of Columbia 11DC SA 53 638 757 764Florida 12FL SA 59 9746 6797 4959Georgia 13GA SA 58 5463 4587 3941...
10 *create some new variables using assign statements;
11 PPOP6070=(POP70-POP60)/POP60*100;
12 PPOP7080=(POP80-POP70)/POP70*100;
13
14 *label all variables;
15 LABEL STATE='NAME OF STATE'
16 FIPS='FEDERAL INFO. PROCESSING CODE'
17 ST='STATE POSTAL CODE'
18 DIV='CENSUS BUREAU DIVISION'
19 CB='CENSUS BUREAU STATE CODE'
20 POP80='POPULATION IN 1980'
21 POP70='POPULATION IN 1970'
22 POP60='POPULATION IN 1960'
23 PPOP6070='% POP. CHANGE 1960-1970'
24 PPOP7080='% POP. CHANGE 1970-1980';
25 RUN;
NOTE: The infile 'p:\briggs\poec5317\state.dat';
FILENAME='p:\briggs\poec5317\state.dat';
RECFM=V,LRECL=256
NOTE: 51 records were read from the infile
The minimum record length was 80.
The maximum record length was 80.
NOTE: The data set WORK.USPOP has 51 observations and 10 variables.
28 *USE PROC STEPS TO ANALYSE THIS SAS DATA SET;
30 *produce a simple print of the data;
31 PROC PRINT DATA=USPOP;
32 VARIABLES ST POP80 PPOP6070 PPOP7080 ;
33 TITLE1 'US DEMOGRAPHIC CHANGE BY STATE' ;
34 TITLE2 '1960-1980';
35 RUN;
NOTE: The PROCEDURE PRINT used 0.98 seconds
37 *sort and print the data by geographic region;
38 PROC SORT DATA=USPOP; BY CB;
NOTE: The data set WORK.USPOP has 51 observations and 10 variables.
NOTE: The PROCEDURE SORT used 0.44 seconds.
39 PROC PRINT DATA=USPOP(OBS=51) NOOBS LABEL;
40 VARIABLES ST POP80 PPOP6070 PPOP7080;
41 FORMAT PPOP6070 PPOP7080 5.1;
42 FOOTNOTE 'STATES ARE ORDERED ACCORDING TO CENSUS BUREAU DIVISION';
43 RUN;
NOTE: The PROCEDURE PRINT used 0.27 seconds.
44 FOOTNOTE;
47
48 *regression to preditc growth in seventies form sixties;
49 PROC REG DATA=USPOP;
50 MODEL PPOP7080=PPOP6070 /STB;
51 TITLE 'Population Growth in the Seventies Predicted from Sixties';
52 RUN;
NOTE: 51 observations read.
NOTE: 51 observations used in computations.
53 *produce descriptive statistics;
NOTE: The PROCEDURE PLOT used 0.48 seconds.
54 PROC MEANS SUM MIN MAX;
55 TITLE 'Population Statistics';
56 RUN;
NOTE: The PROCEDURE MEANS used 0.55 seconds.
7Ron Briggs-UTD
Example of Output from sasex1
US Demographic Change by State1960-1980OBS ST POP80 PPOP6070 PPOP7080 1 AL 3894 5.4178 13.0662 2 AK 402 33.3333 32.2368 3 AZ 2718 35.9877 53.0405 . . 51 WY 470 0.6061 41.5663
US Demographic Change by State1960-1980State % POP. % POP.Postal POPULATION Change Change Code in 1980 1960-1970 1970-1980
ME 1125 2.5 13.2 NH 921 21.6 24.8 VT 511 14.1 15.1 . . CA 23668 26.9 18.5 AK 402 33.3 32.2 HI 965 21.6 25.3STATES are Ordered According to Census Bureau
Division
Population Growth in the Seventies Predicted from SixtiesModel: MODEL1Dependent Variable: PPOP7080 % POP. Change 1970-1980
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Prob>F
Model 1 4600.88079 4600.88079 36.285 0.0001Error 49 6213.20649 126.80013C Total 50 10814.08729
Root MSE 11.26056 R-square 0.4255 Dep Mean 15.78436 Adj R-sq 0.4137 C.V. 71.33995
Parameter Estimates
Parameter Standard T for H0: StandardizedVariable DF Estimate Error Parameter=0 Prob > |T| Estimate
• SAS names(for variables, datasets, etc)– 8 characters or less– start with alpha character– no special characters except _
• SAS variables: 2 types– character
(max. of 200 characters long)– numeric
(default 8 bytes long)
Some of these rules for names and variables are less stringent in newer versions of SAS.
9Ron Briggs-UTD
SAS Jobs
Data State;input pop80 pop90;growth=pop90-pop80;
State
Proc MEANS;variables pop80 pop90;
report/output
raw dataraw data
read by
SAS DATAstep
creates
SAS DATA SET
analyzed by
SAS PROC Step
generates
report/output etc.
10Ron Briggs-UTD
Raw Data & SAS Data Sets• data may exist in one of two forms:
– as a raw data file, managed by the operating system (e.g Windows, or UNIX)
– as a SAS data set, managed by SAS
• SAS data sets are matrices (tables):
– variables occupy the columns • each variable is identified by a name (e.g.
Population)
– observations occupy the lines or rows (e.g. Texas)
• there is no limit to the number of observations
• SAS always processes just one observation at a time
An entry (matrix element) in this table is called a data value.
Once data are in a SAS data set, only the meaning of the variable names needs to be remembered--all physical details of the raw data file can be forgotten.
data set namevariable attributes:
name, type, length, formatinformat, label.
history:source statements used
to create the data set
proc contents;
report
Jane F 11 57.3 83John M 12 59.0 99.5...Alice F 13 56.5 84.0
obs1obs2obs3...obs19
descriptors
data
proc print;
report
Name Gender Age Height Weight
variables in columns
SAS data set
SAS Procedures can only process SAS data sets.A data step is used to read raw data into a SAS data set.
11Ron Briggs-UTD
Example Structure of a SAS Program*my program;
OPTIONS nocenter linesize=80;
DATA sasdata1;
INFILE ‘c:\rawdata’;
INPUT var1 var2;
….
RUN;
DATA sasdata2;
…..
RUN;
DATA sasdata;
MERGE sasdata1 sasdata2;
…
RUN;
PROC first;
….
RUN;
PROC second;
RUN;
comment statement--not part of PROC or DATA step
establishes options--not part of PROC or DATA step
start data step & name sas data set being created
location of raw data file
identify variables within raw data set
other data step statements
end first data step
start second data step
end second data step
start third data step
combine first and second data sets into ‘master’ set
end third data step
begin first PROC step to analyze data
end first PROC step
begin second PROC step for more analysis
end second PROC step
data
ste
pda
ta s
tep
proc
ste
pda
ta s
tep
12Ron Briggs-UTD
How SAS data sets are built: one observation at a time
From another SAS data set
• the first observation is selected from the source data set
• it is processed through all statements in the data step ( up to the run;)– it is output to the data set(s)
being created
• the second observation from the source is selected, etc.
From a raw data file
• the first n logical records (based on the INPUT statement) are selected from the source
• they are processed through all statements in the data step– they are output to the data
set(s) being created, usually as one observation
• the next n logical records from the source are selected, etc.
How SAS data sets are named: temporary v. permanent
• Temporary (duration of program execution only)
--single level name: datasetname e.g. USPOP
• Permanent (saved to disk)
--double level name: libraryname.datasetname e.g. POPDATA.USPOP
13Ron Briggs-UTD
Using Permanent SAS data setsOnce a SAS data set has been saved permanently (by using a two
level name) it can be recalled for use in PROCS, or for further modification
*recall and use in a PROC;
LIBNAME POPDATA 'c:\sasdata\usstate\';
PROC MEANS DATA=POPDATA.USPOP; RUN;
*recall and further modify (transform) the data set temporarily;
LIBNAME POPDATA 'c:\sasdata\usstate\';
DATA USPOP;
SET POPDATA.USPOP;
lines of code
RUN;
*recall, modify and save new version permanently;
LIBNAME POPDATA 'c:\sasdata\usstate\';
DATA POPDATA.USPOPV2
SET POPDATA.USPOP;
lines of code
RUN;
The same name (USPOP) could be used for the new version (called USPOPV2). This is dangerous, but may be necessary if disk space is short.
If the POPDATA library has been defined in SAS Explorer, the LIBNAME statement may be omitted.
For SAS V7, the folder ‘c:\sasadata\usstate\’ contains a file called USPOP.SAS7BDAT
14Ron Briggs-UTD
Third Example SAS Program: sasex2A.sas (1 of 3)
* read in some data from an O/S file and create a temporary SAS data set called USPOP;
*illustrates use of:
--fixed field format
--multiple lines of input per observation;
DATA USPOP;
INFILE ‘p:\appsdata\briggs\poec5317\state2.dat’;
INPUT STATE $ 1-17 FIPS 18-20 ST $ 21-23 DIV $ 24-27 CB 28-30 POP80 31-36 POP70 37-42 POP60 43-48
#2 UNEMP80 3.1 AGE2040 5.1 @10 INC80 6.0;
/* describes the data in the file*/
RUN;
*it is good practice to print some of the data read and check it;
PROC PRINT DATA=USPOP(OBS=5); /* prints first five observations.*/
TITLE 'Check of Data Read In';
* read additional data into a second SAS data set called POP85;
*illustrates use of:
--free-field format
--identification of O/S file on the INFILE statement;
--FILENAME statement to assign an ‘alias’ to a file
INFILE in1; /*uses nickname to identify the data file */ INPUT POP85 POP82 FIPS;
DROP POP82;
RUN;
/************************************************
With free field format , variables don’t necessarily occupy the same column position (field) on each line.
This can be convenient but also dangerous since a single error (e.g. missing space or data value) can result in ALL subsequent data being read incorrect.
*We now combine together our two data sets into a single ‘research’ data set;
PROC SORT DATA=POP85;BY FIPS;
DATA USPOP;
MERGE POP85 USPOP;
BY FIPS;
RUN;
/********************************************
MERGE allows us to combine observation in one data set(s) with those in another. The data sets are placed ‘side-by-side.’ Observations are matched based upon common values on a BY variable. Note that:
--BY variable must exist with same name in both data sets.
--Both data sets must be sorted by the BY variable.
*********************************************/
/********************************************
If no BY variable is available, the two data sets will be placed 'side-by-side' with observations matched only by sequential order. Dangerous! If they are not in the exact same order, or if an observation is missing from one data set, there will be a mismatch from that point onwards and no erroror warning message will be issued.
*********************************************/
16Ron Briggs-UTD
Example SAS Program Set: sasex2A.sas (3 of 3)
*we now save our final research data set permanently on disk;
*be sure that the subdirectory (folder) specified in the LIBNAME statement exists on your file system before running program;
LIBNAME POPDATA 'c:\sasdata\usstate\';
*specifies the subdirectory to use to save the SAS data sets;
DATA POPDATA.USPOP; *note two level name;
SET USPOP; *identifies source SAS data set;
KEEP STATE FIPS ST DIV REG POP85 POP90 POP80 POP70 POP60; * selects variables for inclusion;
LABEL STATE='Name of State'
FIPS='Federal Information Processing code'
ST='State postal code'
DIV='Census Bureau Division'
REG='Census Bureau Region'
POP90='Population in 1990'
POP80='Population in 1980'
POP70='Population in 1970'
POP60='Population in 1960';
RUN;
/*****************************************
SAS data sets with a 'two level' name (POPDATA.USPOP) are saved permanently after the job runs.
The 'first level' (POPDATA in this example) is a SAS alias (often referred to as a SAS library) which identifies, via a LIBNAME command, the sub-directory in which the SAS data set will be stored.
The SAS data set is written to that subdirectory with the 'second level' name as the filename, and an extension of ".sd2" in SAS/PC (or ‘.ssd001’ in UNIX)
(that is, in the example, 'uspop.sd2' or 'uspop.ssd01'.);
‘Single level’ SAS data sets are internally given a first level name by SAS of WORK. (e.g. WORK.USPOP). They exist only for the duration of the job.
A SET statement is used to identify the source SAS data set.
We are ‘setting’ WORK.USPOP into the data step to be saved as
POPDATA.USPOP, if you like.
***********************************************/
/**********************************************
The above two steps could have been accomplished together: LIBNAME POPDATA 'c:\sasdata\usstate\’
New SAS data set(s) can be created from existing SAS data set(s) by:
• transfering and reshaping using SET– changing variables with KEEP, DROP and ASSIGN (# obs =)
– changing observations with IF, DELETE, OUTPUT (# obs <)
• merging with MERGE or UPDATE ('side by side') (# obs =)
– with BY statement (matched)– without BY statement (unmatched)
• concatenating 2 or more with SET ('one after the other') (# obs >)
• sorting observations with PROC SORT (# obs =)
• integrating/combining observations with PROC SUMMARY (# obs <)
Note: more than one data set can be created within one data step
data sets can also be created by PROC steps.
1990
LATX
cnty 1cnty2
1980
TXLAOKNM
States
Regions
TXLAOKNM
TXLAOKNM
NE
MW
S
W
NEMWSW
cnty 1cnty2
cnty 1cnty2
cnty 1cnty2
Reshaping
Data version2; SET version1;
Merge
Concatenate
Data Two;SET TX LA;
Summary
Data p8090;MERGE p80 p90;
23Ron Briggs-UTD
SAS Action Overview:Outputting Information
To an O/S data file:‘Flat files’ can be created from
SAS data sets by using:– FILE ('reverse' of INFILE) to
identify the file• optionally FILENAME also
– PUT ('reverse' of INPUT) to create the format of the file
For printing:• Many SAS procedures produce
‘printout.’
• that is, they write information to the SAS Output window (or the file .lst in UNIX) in a format suitable for printing
• Use standard print commnds to get a paper copy (e.g. File /Print pull down menu on PC)
• PROC PRINT is no different from other PROCS as regards printing
– better name might be PROC DATASHOW
Data _null_;
set oldsas;
file ‘c:\rawdata’;
if _n_=1 then put‘var1, var2,var3’;
put var1‘,’ var2‘,’ var3’,’;
*creates file for export to spreadsheet;
24Ron Briggs-UTD
Steps in debugging a program
ASSUME THAT THE PROGRAM HAS NOT WORKED CORRECTLY!
• correct all syntax problems (ERROR messages)• establish reason for all WARNING messages• check that every data set has the correct number of obs. and vars• are values on all variables reasonable
– check against hard copy/data read in– no 500 kid families
• run test data and hand-check calculations for at least three observations– first, last, one-in-between– be sure sample data includes missing values!