DSCI 325: Handout 13 Combining Data Sets in SAScourse1.winona.edu/thooks/Media/Handout 13 - Combining Datasets in SAS.pdfDSCI 325: Handout 13 – Combining Data Sets in SAS Spring
Post on 12-Mar-2020
6 Views
Preview:
Transcript
1
DSCI 325: Handout 13 – Combining Data Sets in SAS Spring 2017
A variety of methods exist for combining datasets. Specifically in this handout, we will discuss
the following methods:
Appending and Concatenating – these involve adding ROWS to a data set
Merging – this involves adding COLUMNS to a data set
The following table gives a more complete definition of and an example of each method:
Method Example
Appending – this adds the
observations in the second
data set directly to the end
of the original data set
2
Method Example
Concatenating – this
copies all observations
from the data sets you
want to combine and
creates a new data set
Merging – this involves
combining observations
from two or more data sets
into a single observation in
a new data set
Questions:
1. Suppose that three data sets named JanSales, FebSales, and MarSales need to be
combined to create a data set named Qtr1Sales. Which method should be used?
2. Suppose that a Sales data set needs to be combined with a Target data set by month
to compare the sales data to the target data. Which method should be used?
3. Suppose the FebSales data set needs to be added to the YTD data set. Which method
should be used?
3
Using PROC APPEND to Combine Datasets with the Same Variable Structures
Consider the following SAS data sets. Emps originally contains employee information on all
employees hired prior to 2012, and Emps2012 contains only those employees hired in 2012.
Note that both files have the same variables.
Emps
Emps2012
To combine these two data sets and view the result, we can run the following program:
PROC APPEND BASE = Emps
DATA = Emps2012;
RUN;
PROC PRINT DATA = Emps;
RUN;
Emps
The log shows the following:
4
Using PROC APPEND to Combine Datasets with Different Variable Structures
Once again, consider the Emps data set. Recall that this now contains employee information on
all employees hired through 2012. Suppose that in 2013, we stopped recording the gender of
the employees. These data are given in the file Emps2013.
Emps
Emps2013
Now, suppose we run the following program to combine the two data sets:
PROC APPEND BASE = Emps
DATA = Emps2013;
RUN;
PROC PRINT DATA = Emps;
RUN;
Emps
The log displays the following warning, but the procedure still worked.
5
Next, suppose instead that we decided in 2013 to stop recording gender and also to start
recording information on the employees’ highest degree earned.
Emps
Emps2013
Consider the following program and result in the log window:
PROC APPEND BASE = Emps
DATA = Emps2013;
RUN;
PROC PRINT DATA = Emps;
RUN;
Note that when the DATA= data set contained a variable not included in the BASE = data set,
the procedure is not executed in SAS. We could use the FORCE option as the log suggests:
PROC APPEND BASE = Emps
DATA = Emps2013 FORCE;
RUN;
PROC PRINT DATA = Emps;
RUN;
6
SAS returns the following:
Emps
Another Example:
Consider the following data which was recorded from the Rushford-Peterson boys’ basketball
team. This data can be found in the RP Game 1 – RP Game 3 csv files on the course storage
space.
Game1
Game2
Read the above data into SAS data sets named Game1 and Game2, respectively.
7
Then, run a PROC CONTENTS for each data set.
PROC CONTENTS DATA=Game1;
RUN;
PROC CONTENTS DATA=Game2;
RUN;
Next, run the following program to add the data from Game2 to the original Game1 data set:
PROC APPEND
BASE = Game1
DATA = Game2;
RUN;
Once again, the PROC APPEND procedure does not produce any output. The Log window can
be used to verify that this procedure was not successful.
Question: Why was PROC APPEND not successful in this example?
8
The FORCE option can be used to overcome this problem:
PROC APPEND
BASE = Game1
DATA = Game2 FORCE;
RUN;
PROC CONTENTS DATA=Game1;
RUN;
Consider the following output from PROC CONTENTS.
Questions:
1. What is the name of the appended dataset?
2. How many observations are in the appended dataset?
3. How many variables are in the appended dataset?
9
4. Suppose that I accidently submitted my program a second time (i.e., I hit the button
again). Consider the upper portion of the PROC CONTENTS output and a print-out of
the data for the Game1 dataset.
What is the effect of the second submission of this program?
To resolve this error, you can remove certain observations (via their observation
number) using the internal SAS statement, _N_. This is shown next.
DATA GAME1;
SET GAME1;
IF _N_ >= 15 then DELETE;
RUN;
PROC PRINT DATA=GAME1;
RUN;
When is the FORCE Option Needed?
The FORCE option is needed when the DATA= data set contains variables that either
are not in the BASE= data set (note that SAS drops this extra variable from the data set)
are longer than the variables in the BASE= data set (note that SAS truncates the values
from the DATA= data set so that they fit into the length specified in the BASE= data set)
do not have the same type as the variables in the BASE= data set (SAS will replace all
values for the variable in the DATA= data set with missing values and keeps the variable
type that was specified in the BASE= data set)
10
Final Comments on the APPEND Procedure
Note that PROC APPEND works with only two data sets at a time in one step. Also, the
observations in the base data set are not read, and the variable information in the descriptor
portion of the base data set cannot change. We have a lot more flexibility when we use the SET
statement, which is discussed in the next section.
Concatenating Data Sets with the Same Variables
To concatenate two or more data sets in SAS, we use the SET statement in the DATA step. For
example, consider the data from Game 1 and Game 2 used previously in the handout.
PROC CONTENTS DATA=Game1;
PROC CONTENTS DATA=Game2;
RUN;
The output from PROC CONTENTS:
Game1
Game2
11
The following code can be used to concatenate the data from Game 1 and Game 2 to create a
new data set called Games.
DATA Games;
SET Game1 Game2;
RUN;
PROC CONTENTS DATA=Games; RUN;
The result:
Note that any number of data sets can be used in the SET statement. The observations from the
first data set in the SET statement will appear first. The observations from the second set
follow, and so on.
12
Questions:
1. How is the following code different from what was shown above?
DATA Game1;
SET Game1 Game2;
RUN;
PROC CONTENTS DATA=Game1;
RUN;
2. Try running the following code.
DATA Games;
SET Game2 Game1;
RUN;
PROC CONTENTS DATA=Games; RUN;
PROC PRINT DATA=Games; RUN;
What is the result of this code?
13
Concatenating Data Sets with Different Variables
Recall that we also have data on a third game of R-P boys’ basketball. Read in the data for
Game 3 and run the following PROC CONTENTS:
PROC CONTENTS DATA=Games; RUN;
PROC CONTENTS DATA=Game3; RUN;
Games
Game3
What do you notice about the variables in the two data sets?
14
Run the following code to concatenate all three data sets:
DATA Game123;
SET Game2 Game1 Game3;
RUN;
PROC PRINT DATA = Game123;
RUN;
The results are shown below:
Finally, note that the following code can be used to rename the variables in the Game3 data set
so that all of the FG3 and FGA3 data will be read into the same column.
DATA Game123;
SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));
RUN;
PROC PRINT DATA = Game123;
RUN;
15
PROC APPEND versus using the SET statement
The data set that results from concatenating two data sets with the SET statement is the
same as the data set that results from concatenating them with the APPEND procedure
if the two data sets contain the same variables.
The APPEND procedure concatenates much faster than the SET statement because it
does not process the observations from the BASE= data set.
The two methods differ when the variables differ between data sets.
PROC APPEND uses all variables in the BASE= data set and assigns missing
values to observations from the DATA= data set where appropriate; it cannot
include variables found only in the DATA= data set
The SET statement uses all variables and assigns missing values where
appropriate
16
Interleaving Data Sets in SAS
Consider the data sets Game1, Game2, and Game3 from the previous examples. Suppose that
these had already been sorted according to Number.
17
Note that when we concatenate these data sets, the resulting data set is no longer sorted by
Number.
DATA Game123;
SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));
RUN;
PROC PRINT DATA = Game123 HEADING=vertical WIDTH=minimum;
RUN;
Of course, we could use the following code to perform this sort.
DATA Game123;
SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));
RUN;
PROC SORT DATA=Game123;
BY Number;
RUN;
PROC PRINT; RUN;
18
However, if the original data sets are already sorted, it is more efficient to preserve that order
when combining the data sets. This can be accomplished by using a BY statement with a SET
statement in the DATA step.
DATA Game123;
SET Game2 Game1 Game3 (RENAME = (FG_3 = FG3 FGA_3 = FGA3));
BY Number;
RUN;
PROC PRINT DATA = Game123 HEADING=vertical WIDTH=minimum;
RUN;
This is known as interleaving the data sets. Note that before you can interleave observations,
the original data sets must be sorted by the BY variable(s).
19
Combining SAS Data Sets with a One-to-One Merge
Recall that merging data sets involves combining observations from two or more data sets into a
single observation in a new data set.
The above is an example of a one-to-one match merge; i.e., each observation in one data set is
related to exactly one observation in the other data set(s). To see how this merge is successfully
accomplished in SAS, suppose the original data sets were given as follows.
EmpsAU
PhoneH
Unsuccessful Attempt #1
First, try using the following code to merge the data sets:
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
RUN;
PROC PRINT DATA=EmpsAUH; RUN;
What is the problem?
20
Unsuccessful Attempt #2
To get around the above problem, specify a BY variable. For example, try the following code:
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUH; RUN;
Now, check the log window:
Note that observations must be sorted by the common variable(s) that are being matched;
otherwise, the merge is unsuccessful.
Successful Attempt
Consider the following code:
PROC SORT DATA = EmpsAU; BY EmpID;
PROC SORT DATA = PhoneH; BY EmpID;
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUH; RUN;
21
Note that if the data have been sorted in descending sequence, the following merge attempt is
unsuccessful.
PROC SORT DATA = EmpsAU; BY DESCENDING EmpID;
PROC SORT DATA = PhoneH; BY DESCENDING EmpID;
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
BY EmpID;
RUN;
To remedy this, use the DESCENDING option in the BY statement of the DATA step:
PROC SORT DATA = EmpsAU; BY DESCENDING EmpID;
PROC SORT DATA = PhoneH; BY DESCENDING EmpID;
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
BY DESCENDING EmpID;
RUN;
PROC PRINT DATA=EmpsAUH; RUN;
22
Merging Data Sets with Identically Named Variables
Suppose the original data sets used above had been initially stored as follows:
EmpsAU
PhoneH
Note that both data sets contain a variable named First; however, Togar’s name is misspelled as
“Togur” in the PhoneH data set. Suppose the data sets are merged with the following code:
PROC SORT DATA = EmpsAU; BY EmpID;
PROC SORT DATA = PhoneH; BY EmpID;
DATA EmpsAUH;
MERGE EmpsAU PhoneH;
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUH; RUN;
The resulting data set is shown below:
What did SAS do here?
23
Combining SAS Data Sets with a One-to-Many Merge
A one-to-many merge occurs when a single observation in one data set is related to more than
one variable in another data set. For example, consider the following data sets:
EmpsAU
PhoneHW
Consider the following program and output:
PROC SORT DATA = EmpsAU; BY EmpID;
PROC SORT DATA = PhoneHW; BY EmpID;
DATA EmpsAUHW;
MERGE EmpsAU PhoneHW;
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUHW; RUN;
24
Merging with Nonmatches
Consider the following data sets:
EmpsAU
PhoneC
Note that Employees 121152 and 121153 are listed in only one of the data sets; i.e., they have no
match. Consider the following code and output:
PROC SORT DATA = EmpsAU; BY EmpID;
PROC SORT DATA = PhoneC; BY EmpID;
DATA EmpsAUC;
MERGE EmpsAU PhoneC;
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUC; RUN;
Note that the final result contains both the matches (observations with data from both input
data sets) and the non-matches (observations with data from only one of the data sets).
25
Using the IN= Option
Suppose you wanted to eliminate the non-matches from the previous data set, for some reason.
This could be easily accomplished using the IN= option to create a variable that indicates
whether a data set contributed data to the current observation.
For example, consider the following code.
PROC SORT DATA = EmpsAU; BY EmpID;
PROC SORT DATA = PhoneC; BY EmpID;
DATA EmpsAUC;
MERGE EmpsAU (in=Emps)
PhoneC (in=PhoneNum);
BY EmpID;
RUN;
PROC PRINT DATA=EmpsAUC; RUN;
When you run the above program and look at the output, it is identical to what was obtained in
the previous example. This is because the IN= option does not create new variables to be stored
in the final data set; instead, the variables Emps and PhoneNum will exist only during the data
step. They can, however, be used to create other variables or for subsetting. For example,
consider the following programs and resulting output:
DATA EmpsAUC;
MERGE EmpsAU (in=Emps)
PhoneC (in=PhoneNum);
BY EmpID;
IF Emps=1 and PhoneNum=1;
RUN;
PROC PRINT DATA=EmpsAUC; RUN;
DATA EmpsAUC;
MERGE EmpsAU (in=Emps)
PhoneC (in=PhoneNum);
BY EmpID;
IF Emps=1;
RUN;
PROC PRINT DATA=EmpsAUC; RUN; DATA EmpsAUC;
MERGE EmpsAU (in=Emps)
PhoneC (in=PhoneNum);
BY EmpID;
IF PhoneNum=1;
RUN;
PROC PRINT DATA=EmpsAUC; RUN;
top related