1 DSCI 325: Handout 13 – Combining Data Sets in SAS Spring 2017 A variety of methods exist for combining datasets. Specifically in this handout, we will discuss the following methods: Appending and Concatenating – these involve adding ROWS to a data set Merging – this involves adding COLUMNS to a data set The following table gives a more complete definition of and an example of each method: Method Example Appending – this adds the observations in the second data set directly to the end of the original data set
25
Embed
DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
DSCI 325: Handout 13 – Combining Data Sets in SAS Spring 2017
A variety of methods exist for combining datasets. Specifically in this handout, we will discuss
the following methods:
Appending and Concatenating – these involve adding ROWS to a data set
Merging – this involves adding COLUMNS to a data set
The following table gives a more complete definition of and an example of each method:
Method Example
Appending – this adds the
observations in the second
data set directly to the end
of the original data set
2
Method Example
Concatenating – this
copies all observations
from the data sets you
want to combine and
creates a new data set
Merging – this involves
combining observations
from two or more data sets
into a single observation in
a new data set
Questions:
1. Suppose that three data sets named JanSales, FebSales, and MarSales need to be
combined to create a data set named Qtr1Sales. Which method should be used?
2. Suppose that a Sales data set needs to be combined with a Target data set by month
to compare the sales data to the target data. Which method should be used?
3. Suppose the FebSales data set needs to be added to the YTD data set. Which method
should be used?
3
Using PROC APPEND to Combine Datasets with the Same Variable Structures
Consider the following SAS data sets. Emps originally contains employee information on all
employees hired prior to 2012, and Emps2012 contains only those employees hired in 2012.
Note that both files have the same variables.
Emps
Emps2012
To combine these two data sets and view the result, we can run the following program:
PROC APPEND BASE = Emps
DATA = Emps2012;
RUN;
PROC PRINT DATA = Emps;
RUN;
Emps
The log shows the following:
4
Using PROC APPEND to Combine Datasets with Different Variable Structures
Once again, consider the Emps data set. Recall that this now contains employee information on
all employees hired through 2012. Suppose that in 2013, we stopped recording the gender of
the employees. These data are given in the file Emps2013.
Emps
Emps2013
Now, suppose we run the following program to combine the two data sets:
PROC APPEND BASE = Emps
DATA = Emps2013;
RUN;
PROC PRINT DATA = Emps;
RUN;
Emps
The log displays the following warning, but the procedure still worked.
5
Next, suppose instead that we decided in 2013 to stop recording gender and also to start
recording information on the employees’ highest degree earned.
Emps
Emps2013
Consider the following program and result in the log window:
PROC APPEND BASE = Emps
DATA = Emps2013;
RUN;
PROC PRINT DATA = Emps;
RUN;
Note that when the DATA= data set contained a variable not included in the BASE = data set,
the procedure is not executed in SAS. We could use the FORCE option as the log suggests:
PROC APPEND BASE = Emps
DATA = Emps2013 FORCE;
RUN;
PROC PRINT DATA = Emps;
RUN;
6
SAS returns the following:
Emps
Another Example:
Consider the following data which was recorded from the Rushford-Peterson boys’ basketball
team. This data can be found in the RP Game 1 – RP Game 3 csv files on the course storage
space.
Game1
Game2
Read the above data into SAS data sets named Game1 and Game2, respectively.
7
Then, run a PROC CONTENTS for each data set.
PROC CONTENTS DATA=Game1;
RUN;
PROC CONTENTS DATA=Game2;
RUN;
Next, run the following program to add the data from Game2 to the original Game1 data set:
PROC APPEND
BASE = Game1
DATA = Game2;
RUN;
Once again, the PROC APPEND procedure does not produce any output. The Log window can
be used to verify that this procedure was not successful.
Question: Why was PROC APPEND not successful in this example?
8
The FORCE option can be used to overcome this problem:
PROC APPEND
BASE = Game1
DATA = Game2 FORCE;
RUN;
PROC CONTENTS DATA=Game1;
RUN;
Consider the following output from PROC CONTENTS.
Questions:
1. What is the name of the appended dataset?
2. How many observations are in the appended dataset?
3. How many variables are in the appended dataset?
9
4. Suppose that I accidently submitted my program a second time (i.e., I hit the button
again). Consider the upper portion of the PROC CONTENTS output and a print-out of
the data for the Game1 dataset.
What is the effect of the second submission of this program?
To resolve this error, you can remove certain observations (via their observation
number) using the internal SAS statement, _N_. This is shown next.
DATA GAME1;
SET GAME1;
IF _N_ >= 15 then DELETE;
RUN;
PROC PRINT DATA=GAME1;
RUN;
When is the FORCE Option Needed?
The FORCE option is needed when the DATA= data set contains variables that either
are not in the BASE= data set (note that SAS drops this extra variable from the data set)
are longer than the variables in the BASE= data set (note that SAS truncates the values
from the DATA= data set so that they fit into the length specified in the BASE= data set)
do not have the same type as the variables in the BASE= data set (SAS will replace all
values for the variable in the DATA= data set with missing values and keeps the variable
type that was specified in the BASE= data set)
10
Final Comments on the APPEND Procedure
Note that PROC APPEND works with only two data sets at a time in one step. Also, the
observations in the base data set are not read, and the variable information in the descriptor
portion of the base data set cannot change. We have a lot more flexibility when we use the SET
statement, which is discussed in the next section.
Concatenating Data Sets with the Same Variables
To concatenate two or more data sets in SAS, we use the SET statement in the DATA step. For
example, consider the data from Game 1 and Game 2 used previously in the handout.
PROC CONTENTS DATA=Game1;
PROC CONTENTS DATA=Game2;
RUN;
The output from PROC CONTENTS:
Game1
Game2
11
The following code can be used to concatenate the data from Game 1 and Game 2 to create a
new data set called Games.
DATA Games;
SET Game1 Game2;
RUN;
PROC CONTENTS DATA=Games; RUN;
The result:
Note that any number of data sets can be used in the SET statement. The observations from the
first data set in the SET statement will appear first. The observations from the second set
follow, and so on.
12
Questions:
1. How is the following code different from what was shown above?
DATA Game1;
SET Game1 Game2;
RUN;
PROC CONTENTS DATA=Game1;
RUN;
2. Try running the following code.
DATA Games;
SET Game2 Game1;
RUN;
PROC CONTENTS DATA=Games; RUN;
PROC PRINT DATA=Games; RUN;
What is the result of this code?
13
Concatenating Data Sets with Different Variables
Recall that we also have data on a third game of R-P boys’ basketball. Read in the data for
Game 3 and run the following PROC CONTENTS:
PROC CONTENTS DATA=Games; RUN;
PROC CONTENTS DATA=Game3; RUN;
Games
Game3
What do you notice about the variables in the two data sets?
14
Run the following code to concatenate all three data sets:
DATA Game123;
SET Game2 Game1 Game3;
RUN;
PROC PRINT DATA = Game123;
RUN;
The results are shown below:
Finally, note that the following code can be used to rename the variables in the Game3 data set
so that all of the FG3 and FGA3 data will be read into the same column.