Data Preparation Data preparation is the very first thing that you do and spend a lot of time on as a data analyst much before trying to build predictive models using that data. In essence data preparation is all about processing data to get it ready for all kinds of analysis. All industry data collection is mostly driven by business process at front , not by the needs of predictive models. These various processes at some or the other point become reason for introduction of errors here and there in the data. There can be many kind of reasons [not necessarily errors ] for which we'd need to pre process our data and change it for better. • Missing data • Potentially incorrect data • Need for changing form of the data We'll discuss various reasons and methods to achieve our preprocessing goals going forward. Handling Missing Values and Outliers You'll figure out that treatment of both missing values and outliers can at times be very similar. Reason being , both kind of observations are basically not in a state to be used because of missing/ or miss information. Treatment of missing values: • Removing observation with missing values This is the most common method in the industry. Reason being that missing values are generally a very very small chunk of the data that you deal with. However you need to keep following things in mind while removing the observations because of missing data: 1. If observations with missing values are significant chunk of the data then you should not drop all observations with missing values 2. If the variable which had missing values has entered in your model, you need to plan what to do when you encounter missing values in the unseen data while model has been put in production. • Imputing [filling up] missing values with mean/median/mode of the respective variables. We don't need to get into details of this. • Imputing with business logic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Preparation Data preparation is the very first thing that you do and spend a lot of time on as a data analyst much before trying to build predictive models using that data.
In essence data preparation is all about processing data to get it ready for all kinds of analysis. All industry data collection is mostly driven by business process at front , not by the needs of predictive models. These various processes at some or the other point become reason for introduction of errors here and there in the data.
There can be many kind of reasons [not necessarily errors ] for which we'd need to pre process our data and change it for better.
• Missing data • Potentially incorrect data • Need for changing form of the data
We'll discuss various reasons and methods to achieve our pre-‐processing goals going forward.
Handling Missing Values and Outliers
You'll figure out that treatment of both missing values and outliers can at times be very similar. Reason being , both kind of observations are basically not in a state to be used because of missing/ or miss information.
Treatment of missing values: • Removing observation with missing values
This is the most common method in the industry. Reason being that missing values are generally a very very small chunk of the data that you deal with. However you need to keep following things in mind while removing the observations because of missing data:
1. If observations with missing values are significant chunk of the data then you should not drop all observations with missing values
2. If the variable which had missing values has entered in your model, you need to plan what to do when you encounter missing values in the unseen data while model has been put in production.
• Imputing [filling up] missing values with mean/median/mode of the respective variables.
We don't need to get into details of this.
• Imputing with business logic
Many at times , we know what a missing value might mean in the context of business process. For example, If account balance is missing for the bank account , it might mean that the account balance is zero.
Treatment of Outliers: • Removing observations with outliers
There are two issues with including outliers in the predictive analysis
1. Because of otuliers , the predictor variables ranges get inflated artificially . The model that you get might not be applicable across that range
2. Some outliers have high leverage in context of the modelling process. In presence of such observations you'll get a model which is not a good fit for the general population [data].
If you are preparing data for predictive modelling , you need to remove outliers. However if the variable with outliers is present in the model, you need to figure out what to do when you encounter outlier values in the unseen data while model has been put in production.
• Flooring/Capping
In some cases it might make sense to impute outlying values with upper and lower limits when they exceed either of these values. Imputing with lower limit is called flooring and imputing with upper limit is called capping.
• Imputing with business logic
Many at times , we know what an outlier value might mean in the context of business process.
Need for changing form of the data
Transforming and extracting information from the existing data
Consider a simple transaction date and time column for an eCommerce website. A simple column containing dates will not be of much use but a lot of information can be extracted from this simple looking data. E.g. : Information regarding gaps between transactions, number of transactions happening every week or day or month etc.
Collapsing and Summarising Data:
Many at times we need to collapse data based on some grouping variables [This is more or less same as what we discussed in univariate statistics]. E.g. Finding out monthly summary of the data from a daily transaction data. In addition to tools which we learned in Univariate Statistics module we will learn few new things in the "to do with SAS" section.
Transposing Data
This is one of the very useful procedures we'll learn here. Below given is an example of long data
Since SAS process data row by row in many procedures as well as in data step codes, many at times these kind of transformation are very much needed. We'll learn how to achieve the same with Proc Transpose.
Formatting Data Columns, Creating Reports
In addition to other tools we'll also learn very useful procedures for creating all kinds of reports and user defined data format using Proc Report and Proc Format
Data Preparation with SAS In coming section we'll learn many tools, SAS functions and utility procedures to achieve many data preparation tasks that we discussed so far and then some more. We'll start with finding answers for a few simple questions based on data "bank_transactions" using tools that we learned in Univariate Statistics module. Later we'll see how the same can be achieved with much simpler and faster manner.
Q: find category of highest transaction in debit/credit for each month
A: We can sort the data by year,month and then amount in descending order. Then within that group we can find the observation with max amount.
proc sort data=dp.bank_transactions; by year month dc descending amount; run; proc means data=dp.bank_transactions max; var amount; by year month dc; run;
Q: total transaction for debit/credit each month
A: We can again use combination of proc sort and proc means to find this out with "sum" option.
proc sort data=dp.bank_transactions; by year month dc; run; proc means data=dp.bank_transactions sum; var amount; by year month dc; run;
Find this works out alright but as we have seen before , taking output of proc means to output dataset is not a straight forward task.Lets learn about "first." and "last.", these are temporary variables created at the back end when a by statement is used in data step code. [ keep in mind that "by" statement can be used after sorting your data only ]. Lets create the data that we'll be using to learn for the same:
data example; input grps section $ score; cards; 1 a 10 1 a 20 1 b 30 1 b 40 2 a 50 2 a 60 2 b 0 2 b -‐10 ; run;
The dataset that we have create is already sorted, hence we can simply use "by" statement without really sorting this. When we use "by" statement; "first." and "last." will create temporary variables which take values "1" and "0" for each observation depending on groups created by variables used in "by statement". Lets look at this example given below to understand this better:
data example; set example;
by grps; first_grps=first.grps; last_grps=last.grps; run; data example1; set example; by grps section; first_section=first.section; last_section=last.section; run;
In the first program we used "by grps", the variable "grps"" creates two groups in the data, one for the value "1" and another for the value "2". The variable "first." takes value "1" for the first observation in the groups and "0" for others, on the other hand "last." variable takes value "1" for the last observation in the group and "0" for others.
In the second program we used "by grps section", this makes more groups in the data, first. and last. takes values "1" and "0" accordingly.
We don't really need to create these first. and last. variables to use them, in the programs above we created those just for demonstration. Lets use them to solve a similar problem which we did for the bank_transaction data.Lets get the top score for each section.
proc sort data=example; by grps section descending score; run; data top_example; set example; by grps section; if first.section; run;
get total score for each section:
data total_scores(drop = score); set example; by grps section; total_score+score; if first.grps then total_score=score; if last.grps then output; run;
In a similar fashion , we can solve the original problem that we solved for dataset bank_transactions:
proc sort data=dp.bank_transactions; by year month dc descending amount; run; data bt_summary(drop=day category); set dp.bank_transactions; by year month dc; if last.dc then output; run; data bt_summary_total(drop= amount day category); set dp.bank_transactions; by year month dc; total_amount+amount; total_transac+1; if first.dc then do; total_amount=amount; total_transac=1; end; if last.dc then output; run;
Numeric Functions
Before we start to learn about SAS functions, lets learn about a way to "not" create a dataset every time we just want to see what a function does. Handy way is to name my outgoing dataset simply "null" , this tells sas not to create any dataset in the data step program. But we do need something which will show us the result of the function that we just used. "put" statement comes to rescue. Put statement prints whatever we ask it to , in the log. Remember , not in the result window but in the log window. Lets look at few numeric functions available in the SAS system:
data _null_; x=sqrt(2000000); y=log(x); z=sum(23,34,56); put x; put y; put z; run;
There are several such numeric functions. A longer list can be found here :
a quick list that comes to mind is this : log, exp, sqrt, mean, median, sum, n, nmiss. These functions do what the name sounds like. That also is not really an exhaustive list. In fact you can find almost all direct mathematical formulas that you use in the
SAS function list if you look for the documentation. We'll not be going through all the function.
One important thing however is to understand that data processing happens in SAS row by row not column by column lets create a data set and understand how these functions work row by row ; not column by column .
data func; input x y z; cards; 10 20 30 1 2 3 5.4 6.7 9.33 100 200 0 ; run;
now lets apply some numerical functions and see what they do.
data func; set func; s1=sum(x); s2=sum(x,y,z); run;
You would notice that the variable "s1" above is not containing sum of the entire column x. In fact it is rather containing values exactly same as x. why? , because these functions only work on rows , not on columns. So in the same row, there is only one value of x to be summed, and the result is just x.
Now on the other hand, "s2" is sum of values of variables x,y and z in the same row.
Note: you must be wondering , why do we need a function for sum when we can use the algebraic sign "+" for the same purpose. Well, there is a small difference. When function sum encounters a missing value while performing addition, it ignores it, where as if that happens while using "+" operator , you'll get a missing value as the result. Lets see an example:
data _null_; x=sum(10,20,30,.); y=10+20+30+.; put x; put y; run;
String Functions
We saw that most of the numeric functions are simply named as their mathematical names. These names readily make sense and tell what do we use these functions for. Same is not the case for string functions, or functions which are used to process
character variables. We'll talk about few important character functions in detail with example.
scan
This function takes a string as input . Imagine a scenario where this input string is an address with elements of it such as home number, street , city etc are separated by "/". Third input scan function is this "delimiter" which separates different elements of the string within it. Second input is the element which you want to extract from the string. For example we have this address:
"1502/Panch Mahal/Malad/Mumbai"
And we want to extract suburb name from this address which is the second element if we consider "/" to be the delimiter in the string. Lets see:
data _null_; address="1502/Panch Mahal/Malad/Mumbai"; suburb=scan(address,2,"/"); put suburb; run;
Explore Yourself: Can we use multiple delimiters with scan?
substr
Function substr can be used to extract a substring from a larger string if we know position of start and end of the said substring in the larger input string. Keep in mind that counting start with one not zero as seen in other programming languages.Here are few examples for the same:
data _null_; IP="192.168.1.1:543"; port=substr(IP,5,3); put port; run; data _null_; IP="192.168.1.1:543,AutomatedMails"; port=substr(IP,13); port1=substr(IP,13,3); put port; put port1; run;
Explore Yourself: What happens if we give input for end position in the function substr?
trim , strip , || ,catx,compress
Functions named above and operator || are used remove white spaces[ trim ,strip,compress] from the input string in various ways and combining them [||, catx]. We'll learn through some examples:
data _null_; x="Lalit"; y="Sachan"; z=x||y; m=x||"-‐7@"||y; put z; put m; run;
You can see that operator || [this is double pipe symbol] simply combines strings. Lets look at white space removing functions and peculiarities associated with them.
data _null_; x=trim(" Lalit "); y=trim(" Sachan "); z="@"||x||"@"||y||"@"; x_l=length(x); y_l=length(y); put x_l; put y_l; put z; run;
You can see that in above example none of the spaces get removed. This is a peculiar behavior of the function trim . If you use function trim the variable value assignment directly then only it works. It removed trailing spaces from the string.:
data _null_; x=" Lalit "; y=" Sachan "; z="@"||trim(x)||"@"||trim(y)||"@"; put z; run;
now lets look at how strip behaves. We are using length function to check if trim/strip functions are working , in addition to printing them in log using "put" function.
data _null_; x=strip(" Lalit "); y=strip(" Sachan "); z="@"||x||"@"||y||"@"; put z; run;
As opposed to trim function ,in the above example strip is removing leading spaces , let see how it behaves when used directly during new variable creation.
data _null_; x=" Lalit "; y=" Sachan "; z="@"||strip(x)||"@"||strip(y)||"@"; put z; run;
in this case it removes all [not the ones in between] the spaces, leading and trailing */
compress
This function removes all spaces from the string , including the ones which are in between.
data _null_; x=" Lalit Sachan "; z="@"||compress(x)||"@"; put z; run;
catx
This function concatenates strings after removing leading and trailing spaces from them. First argument however here is the delimiter which will be used while combining the strings. If any of the strings to be combined are simply white spaces they are ignored. Here is an example to make you understand better. Notice how to white space is simply ignored, while creating y. In both the cases "$" has been used a delimiter.
data _null_; x=catx("$"," 45 "," ytfy ","asdf "); y=catx("$"," xd ", " ","dr "); put x; put y; run;
Explore Yourself: Find out what functions "upcase" and "lowcase" do? Come up with a functioning example.
find
This function is used to find the starting position of a smaller substring in a larger input string. Remember that counting start with one from the beginning of the string. The first argument to function is the larger string where we aim to find the smaller one. Second argument is the string which we are looking for in the larger one. Third argument is where we should start in the larger string to look for the
smaller one. If that number is "+ve" then search is done from left to right, if that number is negative , search is done from right to left. However returned value is the starting position of the smaller string from the beginning of the larger string only.
if third argument is left blank, then by default search starts at the beginning of the string and is done left to right.Also note that if there are multiple occurrences of the smaller strings, the starting position of that occurrence is returned which is encountered first depending on starting position and direction of the search as specified by various inputs of the function Below given here are few examples:
data _null_; x="akjs@askj@asdkf@a"; z=find(x,"@a"); m=find(x,"@a",7); k=find(x,"@a",-‐17); a=find(x,"@a",-‐7); b=find(x,"@a",17); put z; put m; put k; put a; put b; run;
Search here by default is case sensitive as can be seen in the example below. "s" is not found because the letter "S" is in caps in the larger string.
data _null_; x="SjdksdA"; y=FiNd(x,"s"); put y; run;
If you want your search to be case insensitive, you need to use the identifier "i". The first and second arguments are meant for strings to be searched in and strings to be searched for . Beyond that "i" means identifier i which makes your search case insensitive.
data _null_; x="akjs@askIj@asdkf@a"; z=find(x,"@A"); m=find(x,"@A","i",7); n=find(x,"i",7,"i"); put m; put z; put n; run;
Explore Yourself: What does the identifier "t" do in the function "find"?
Tranwrd
This function is used to replace substring occurrences in the larger input string. In the example given below we are replacing all hyphens with "/" . Second argument is what we want to replace and the third is what we want to replace it with. Of course first argument being the string where we want to do these replacements.
data _null_; address="1203-‐Some Tower-‐powai/Mumbai"; proper_add=tranwrd(address,"-‐","/"); put proper_add; run;
Here is an exercise. Run the code given below to create the dataset.: data Add; length address $40; input address $; cards; 1604-‐some-‐chandiwali,Mumbai 12-‐a/Delhi First-‐Street,Chennai ; run; Once that is done. Create a column in the dataset which contains city names extracted from these address. Do that using whatever functions you think are going to be appropriate for the process.
Exercise Solution: data add(drop=a1 a2 z); set add; a1=tranwrd(address,"-‐",","); a2=tranwrd(address,"/",","); z=find(a2,",",-‐length(a2)); city=substr(a2,z+1); run;
Utility Functions and Procedures
In addition to numeric and string functions there are many more utility procedures in SAS which enable us to do many other tasks other than simply extracting or transforming numeric or categorical variables.
Input
This functions is used to apply a specific format while creating a new variable. Remember that it can not be used to change format of existing variables.
data temp; x="12/01/2013";
run; /*" In the data set temp above, x is essentially a string as can be confirmed by looking at its type, now we can apply a date format on this to create another variable which contains the same values but " data temp; set temp; format y mmddyy10.; y=input(x,ddmmyy10.); put y; run;
Many at times it happens that variable which is supposed to be in numeric format comes out to be in character format while importing that data due to presence of some character values. We can use input function to convert this variable into a numeric one by applying format "8.". Lets see an example of doing the same:
data temp; input some $; cards; 10 20 30 a b 12 13 14 ; run;
If you look at type of variable "some" in the data temp, it is character. Lets convert that to numeric variable.
data temp; set temp; some_num=input(some,8.); run;
smallest , largest
Function min and max always give largest and smallest value , however at times we might need n!" largest or smallest value among many. For that we can use smallest or largest functions. First argument to these function is the value of "n". Example given below get 3rd largest and 3rd smallest values from the data respectively.
data _null_; x=smallest(3,23,1,4,-‐5,7,0,10); y=largest(3,23,1,4,-‐5,7,0,10); put x;
put y; run;
Lag
Since by default SAS processes data row by row, there is no direct method to access previous observations in data step. For doing so we have to use lag function which is designed do specifically this:
data temp; input A $ B C; cards; truck 10 1 truck 20 2 truck 30 3 car 40 4 car 50 5 car 60 6 ; run; data temp; set temp; D=lag(B); run;
You can see that new variable "D" is simply take previous values of variable. Or in other words its equivalent to column "B" with one lag. You can apply lag function with multiple lags too by using function lagn. Following is an example with lag3.
data temp; set temp; D=lag3(B); run;
However this gets tricky if you use the function lag inside a condition. In that case lag function returns only those values which it gets to see within the condition block. Here is is example. Try to understand this and if doesn't make sense ask for a detailed explanation in the class:
proc sort data=temp; by A; run; data temp; set temp; by A; new_var=first.A; if first.A then D=lag(B); else D=lag(C); run;
Round
Round function is used to round off digits for numeric values. First argument is the value being rounded off and second argument is indicator for the rounding.
data _null_; x=123.45567; y=round(x); z=round(x,0.001); put z; put y; run;
in the above example , second input is .001 which means x will rounded off up to 3rd digit after decimal. You can consider the process like this. First x is divided by .001, rounded off to nearest integer and then multiplied by .001.
So x/.001 = 123455.67, this being rounded off to nearest integer becomes 123456 this again gets multiplied by .001 and becomes 123.456
lets take few more examples:
data _null_; x=123.45567; y=round(x,0.1); z=round(x,100); m=round(x,10); put m; put z; put y; run;
consider m=round(x,10), first x gets divided by 100 which becomes 12.345567 then it gets rounded off to nearest integer which is 12, then it gets multiplied by 10 and becomes 120, which is the final value of m.
Explore Yourself: Do the above the process for y and z also and see whether the final values match with what your calculations.
Proc Rank
Proc rank is used to make bins in your data. You can use a numeric variable by which you want to make bins in the data. For example in the data set sashelp.cars , we want to make bins in the data by variable invoice. What happens is that data is sorted by variable invoice and then starting from top equal numbers of observations are put into each bin.
proc rank data=sashelp.cars out=car_rank group=10; var invoice;
ranks basket; run;
groups=10 tells proc rank there are going to 10 bins/groups in the data. "ranks basket": this names the variable containing group/bin number as "basket". Bin numbering starts with 0.
Proc transpose
This is used to make your data from long to wide or wide to long as discussed before. Lets create the same data which we showed there
Following program using proc transpose converts the long format data into wide:
proc transpose data=long1 out=wide1 prefix=year_; by famid ; id year; var faminc; run;
by statement: makes rows based on how many unique values the specified variable in the by statement has
id statement: makes columns based how many unique values the specified variable in the id statement has
var statement : fills the values of variable specified in the var statement in the resulting cells of transposed dataset. If some cells don't have a corresponding values in the incoming dataset they are assigned missing values such as cell corresponding to year 97 and famid 3 in the above example.
Now next question that might be bothering you must be what happens if there are more than one variables to filled in, you simply get multiple rows corresponding to each value of variable in "by statement". For example in the example given below you get 2 rows for each famid.
Proc format is used to create user defined format. This does not require any input from a dataset and create format can be applied on any variable in any dataset. Here is an example given below. Also it does not change underlying format of the variable, it only changes how it is displayed.
proc format; value $jc 'one'='Management' 'two'='Trainees'; value Grade 0-‐32="F" 33-‐45="C" 46-‐58="B" 60-‐100="A"; run;
"value" statement here is the one which essentially creates the format for you. If this format is going to be *applied on on character values then the format name starts with a "$" sign otherwise the name starts as usual. Naming constraints for formats is same as variable names. in the value statement given above we created format $jc, if we apply it on a categorical variable and the value is "Management" then displayed value will be 'one' and 'two' if the value is "Trainees". If the value does not match with either of the "Management" or "Trainee" then value will displayed as is.
For the numeric format Grade , if the numeric variable on which it is being applied, is in the range 0-‐32 then "F" will be displayed, if any of the values does not match with the given ranges then a * will be displayed in its place. Lets see an example of these formats being applied on the data set temp. To emphasize that the underlying values don't change i have also created a numeric variable in the same data step.
data temp; input jobs $ marks; cards; one 10 two 75 one 34 two 59 abc 79 one 49 one 56 two 90 abc 20 ; run; data temp; set temp; format jobs $jc.; format marks grade.; marks2=marks/2; run;
Proc SQL
This is implementation of SQL language with in SAS. All of the tasks which we'll see here can be achieved with whatever we have learned so far. SQL language queries are however at times easy to read and write. But do not use them with large dataset. They might not be as fast as their data step counterparts.
You will see that SQL queries are very English like to write. They are mostly used to subset,summarize and pre-‐process the data. There are no predictive modeling procedures in SQL framework.
We'll see that all SQL queries are just select statements. These select statements have incremental capacities which we'll see starting with the simplest form where you select all the observation from the incoming dataset. All SQL queries are going to be in a block starting with "proc sql" and closed with "quit". Result of the selection will be displayed in result window. If we want to put the result of selection in a data set we can simple add "create table as table_name " in front of the select statement. Lets see some example for the same.
proc sql; select * from sashelp.cars; quit;
All observations from sashelp.cars are displayed in result window.
proc sql ; create table lalit as select * from sashelp.cars; quit;
All obs are still displayed but a table named "lalit" is created in the work library [you can supply a lib ref for it to be createdin some other location] with all the observations. Here on wards we'll not use create table, whenever you want to do that , simply add that part in front of select statement.
If you do not want to select columns of the data you restrict by mentioning the variable names separated by comma.
proc sql; select name,nhits from sashelp.baseball; quit;
This controls number of variables/columns which you are selecting from the dataset.now what if i want to restrict number of observations There are many ways to do it.
proc sql inobs=10; select name from sashelp.baseball; select make from sashelp.cars; quit;
using inobs/outobs with proc sql statements restrict number of incoming/outgoing observations for all the select statements in that block. If we want to restrict number of obs selectively for each select statement separately we can do the following.
proc sql; select name from sashelp.baseball(obs=10); select make from sashelp.cars(obs=20); quit;
There is also an option called outobs. Outobs specifies number of observation which go out. In the current example it works same as inobs but when you are processing data it behaves differently.
proc sql outobs=10; select name from sashelp.baseball; quit;
As we saw in data step, just restricting number of observations is not enough, We need some way to conditionally filter observation. We can achiever that by using "where " with select statement as following:
proc sql; select invoice,drivetrain from sashelp.cars where origin="Asia"; quit;
we can write multiple conditions as well by combining them with and, or operators.
proc sql; create table temp as select invoice,origin,drivetrain,type,mpg_city from sashelp.cars
where origin="USA" and type="Sedan" and mpg_city>15; quit;
Remember that you don't need to necessarily select the variable on which you apply conditional statement. Next requirement is to sort the data, for that we'd add order by to our select statement.
proc sql; select invoice,origin from sashelp.cars order by invoice; quit;
default order of sorting is ascending. If you want to sort things in descending order then you'll have to use the keyword desc as given below :
proc sql; select invoice,origin from sashelp.cars order by invoice desc; quit;
you can order by multiple variables as well:
proc sql; select origin,msrp from sashelp.cars order by origin,msrp desc ; quit;
Now next is to group variables or get aggregated/summary statistics such as mean std etc which are defined for a group of values rather than individual observation.
proc sql ; select origin,drivetrain,mean(msrp) as msrp_avg from sashelp.cars group by origin,drivetrain; quit;
Here the summary operations [ such as calculating mean in the above example] is carried out on the groups created by "group by". Here are few more examples , one which include order by as well.s
proc sql ; select origin, std(msrp) as price_std from sashelp.cars group by origin; quit; proc sql ; select make, std(msrp) as price_var from sashelp.cars group by make order by price_var; quit;
now if we wanted to put condition here on the new var which is created [price_var]; lets see if simple where condition works :
proc sql ; select make, std(msrp) as price_var from sashelp.cars
where price_var>10000 group by make order by price_var; quit;
above mentioned code throws an error:
ERROR: The following columns were not found in the contributing tables: price_var.
To apply conditions on the variables which are created in sql queries we need to use "having"
proc sql ; select make, std(msrp) as price_var from sashelp.cars group by make having(price_var>10000) order by price_var ; quit;
sequence in which you should write :where > group by > having > order by. Next we'll see how to get data from multiple tables.
Key is to give names to tables which can be use to reference table while extracting those columns from it. We'll try to solve following case which involves getting data from multiple tables.
case: datasets gaming1,2,3 contain information on customers of a gaming company which provides online platform for playing team games such as AOE, DOTA , CS . we want to get those customers ids which play DOTA on mac os in solo sessions with free license type and their average time per session is more than 40 minutes
Lets first list what information stored where:
gaming1=gamer_id, game name, atps gaming2= gamer_id , os , license gaming3= gamer_id, session_type, netspeed
We'll give names to tables in select statement only, i have written following select statement in multiple lines for better readability.
proc sql; select a.gamer_id from dp.gaming1 as a, dp.gaming2 as b, dp.gaming3 as c where b.os="mac" and a._game_name="dota" and
a.atps>40 and c.session_type="solo" and b.license="free" and a.gamer_id=b.gamer_id and a.gamer_id=c.gamer_id ; quit;
The part "a.gamer_id=b.gamer_id and a.gamer_id=c.gamer_id" is must for setting up correspondence between observations of multiple tables. If you don't do that you'll get a cross product of observation as shown below:
data s1; input id a $; cards; 1 q 2 a 3 z ; run; data s2; input id b $; cards; 1 p 2 l 3 m ; run; proc sql; select a,b from s1,s2; quit;
Now if we put that correspondence setting where condition we'll get the desired result.
proc sql; select a,b from s1,s2 where s1.id=s2.id; quit;
Explore Yourself: * How to join/merge tables using SQL * What do distinct, count do when used with SQL queries
We'll conclude here. In case of any doubts regarding content of this study material, please post on QA forum in LMS.