Top Banner
Data Preparation Data preparation is the very first thing that you do and spend a lot of time on as a data analyst much before trying to build predictive models using that data. In essence data preparation is all about processing data to get it ready for all kinds of analysis. All industry data collection is mostly driven by business process at front , not by the needs of predictive models. These various processes at some or the other point become reason for introduction of errors here and there in the data. There can be many kind of reasons [not necessarily errors ] for which we'd need to pre process our data and change it for better. Missing data Potentially incorrect data Need for changing form of the data We'll discuss various reasons and methods to achieve our preprocessing goals going forward. Handling Missing Values and Outliers You'll figure out that treatment of both missing values and outliers can at times be very similar. Reason being , both kind of observations are basically not in a state to be used because of missing/ or miss information. Treatment of missing values: Removing observation with missing values This is the most common method in the industry. Reason being that missing values are generally a very very small chunk of the data that you deal with. However you need to keep following things in mind while removing the observations because of missing data: 1. If observations with missing values are significant chunk of the data then you should not drop all observations with missing values 2. If the variable which had missing values has entered in your model, you need to plan what to do when you encounter missing values in the unseen data while model has been put in production. Imputing [filling up] missing values with mean/median/mode of the respective variables. We don't need to get into details of this. Imputing with business logic
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DP

 

 

Data  Preparation  Data  preparation  is  the  very  first  thing  that  you  do  and  spend  a  lot  of  time  on  as  a  data  analyst  much  before  trying  to  build  predictive  models  using  that  data.  

In  essence  data  preparation  is  all  about  processing  data  to  get  it  ready  for  all  kinds  of  analysis.  All  industry  data  collection  is  mostly  driven  by  business  process  at  front  ,  not  by  the  needs  of  predictive  models.  These  various  processes  at  some  or  the  other  point  become  reason  for  introduction  of  errors  here  and  there  in  the  data.  

There  can  be  many  kind  of  reasons  [not  necessarily  errors  ]  for  which  we'd  need  to  pre  process  our  data  and  change  it  for  better.  

• Missing  data  • Potentially  incorrect  data  • Need  for  changing  form  of  the  data  

We'll  discuss  various  reasons  and  methods  to  achieve  our  pre-­‐processing  goals  going  forward.  

Handling  Missing  Values  and  Outliers  

You'll  figure  out  that  treatment  of  both  missing  values  and  outliers  can  at  times  be  very  similar.  Reason  being  ,  both  kind  of  observations  are  basically  not  in  a  state  to  be  used  because  of  missing/  or  miss  information.  

Treatment  of  missing  values:  • Removing  observation  with  missing  values  

This  is  the  most  common  method  in  the  industry.  Reason  being  that  missing  values  are  generally  a  very  very  small  chunk  of  the  data  that  you  deal  with.  However  you  need  to  keep  following  things  in  mind  while  removing  the  observations  because  of  missing  data:  

1. If  observations  with  missing  values  are  significant  chunk  of  the  data  then  you  should  not  drop  all  observations  with  missing  values  

2. If  the  variable  which  had  missing  values  has  entered  in  your  model,  you  need  to  plan  what  to  do  when  you  encounter  missing  values  in  the  unseen  data  while  model  has  been  put  in  production.  

• Imputing  [filling  up]  missing  values  with  mean/median/mode  of  the  respective  variables.  

We  don't  need  to  get  into  details  of  this.  

• Imputing  with  business  logic  

Page 2: DP

 

 

Many  at  times  ,  we  know  what  a  missing  value  might  mean  in  the  context  of  business  process.  For  example,  If  account  balance  is  missing  for  the  bank  account  ,  it  might  mean  that  the  account  balance  is  zero.  

Treatment  of  Outliers:  • Removing  observations  with  outliers  

There  are  two  issues  with  including  outliers  in  the  predictive  analysis  

1. Because  of  otuliers  ,  the  predictor  variables  ranges  get  inflated  artificially  .  The  model  that  you  get  might  not  be  applicable  across  that  range  

2. Some  outliers  have  high  leverage  in  context  of  the  modelling  process.  In  presence  of  such  observations  you'll  get  a  model  which  is  not  a  good  fit  for  the  general  population  [data].  

If  you  are  preparing  data  for  predictive  modelling  ,  you  need  to  remove  outliers.  However  if  the  variable  with  outliers  is  present  in  the  model,  you  need  to  figure  out  what  to  do  when  you  encounter  outlier  values  in  the  unseen  data  while  model  has  been  put  in  production.  

• Flooring/Capping  

In  some  cases  it  might  make  sense  to  impute  outlying  values  with  upper  and  lower  limits  when  they  exceed  either  of  these  values.  Imputing  with  lower  limit  is  called  flooring  and  imputing  with  upper  limit  is  called  capping.  

• Imputing  with  business  logic  

Many  at  times  ,  we  know  what  an  outlier  value  might  mean  in  the  context  of  business  process.  

Need  for  changing  form  of  the  data  

Transforming  and  extracting  information  from  the  existing  data  

Consider  a  simple  transaction  date  and  time  column  for  an  eCommerce  website.  A  simple  column  containing  dates  will  not  be  of  much  use  but  a  lot  of  information  can  be  extracted  from  this  simple  looking  data.  E.g.  :  Information  regarding  gaps  between  transactions,  number  of  transactions  happening  every  week  or  day  or  month  etc.  

Collapsing  and  Summarising  Data:  

Many  at  times  we  need  to  collapse  data  based  on  some  grouping  variables  [This  is  more  or  less  same  as  what  we  discussed  in  univariate  statistics].  E.g.  Finding  out  monthly  summary  of  the  data  from  a  daily  transaction  data.  In  addition  to  tools  which  we  learned  in  Univariate  Statistics  module  we  will  learn  few  new  things  in  the  "to  do  with  SAS"  section.  

Page 3: DP

 

 

Transposing  Data  

This  is  one  of  the  very  useful  procedures  we'll  learn  here.  Below  given  is  an  example  of  long  data  

famid   year   faminc  1   96   40000  1   97   40500  1   98   41000  2   96   45000  2   97   45400  2   98   45800  3   96   75000  3   98   77000  

sometimes  it'd  make  sense  to  this  kind  of  the  data  into  a  wide  format  .Below  given  is  an  example  of  same  data  in  a  wide  format.  

famid   year_96   year_97   year_98  1   40000   40500   41000  2   45000   45400   45800  3   75000   .   77000  

Since  SAS  process  data  row  by  row  in  many  procedures  as  well  as  in  data  step  codes,  many  at  times  these  kind  of  transformation  are  very  much  needed.  We'll  learn  how  to  achieve  the  same  with  Proc  Transpose.  

Formatting  Data  Columns,  Creating  Reports  

In  addition  to  other  tools  we'll  also  learn  very  useful  procedures  for  creating  all  kinds  of  reports  and  user  defined  data  format  using  Proc  Report  and  Proc  Format  

Data  Preparation  with  SAS  In  coming  section  we'll  learn  many  tools,  SAS  functions  and  utility  procedures  to  achieve  many  data  preparation  tasks  that  we  discussed  so  far  and  then  some  more.  We'll  start  with  finding  answers  for  a  few  simple  questions  based  on  data  "bank_transactions"  using  tools  that  we  learned  in  Univariate  Statistics  module.  Later  we'll  see  how  the  same  can  be  achieved  with  much  simpler  and  faster  manner.  

libname  dp  "/folders/myfolders/Datasets/Data  Prep";  

Q:  find  category  of  highest  transaction  in  debit/credit  for  each  month  

A:  We  can  sort  the  data  by  year,month  and  then  amount  in  descending  order.  Then  within  that  group  we  can  find  the  observation  with  max  amount.  

Page 4: DP

 

 

proc  sort  data=dp.bank_transactions;  by  year  month  dc  descending  amount;  run;    proc  means  data=dp.bank_transactions  max;  var  amount;  by  year  month  dc;  run;  

Q:  total  transaction  for  debit/credit  each  month  

A:  We  can  again  use  combination  of  proc  sort  and  proc  means  to  find  this  out  with  "sum"  option.  

proc  sort  data=dp.bank_transactions;  by  year  month  dc;  run;    proc  means  data=dp.bank_transactions  sum;  var  amount;  by  year  month  dc;  run;  

Find  this  works  out  alright  but  as  we  have  seen  before  ,  taking  output  of  proc  means  to  output  dataset  is  not  a  straight  forward  task.Lets  learn  about  "first."  and  "last.",  these  are  temporary  variables  created  at  the  back  end  when  a  by  statement  is  used  in  data  step  code.  [  keep  in  mind  that  "by"  statement  can  be  used  after  sorting  your  data  only  ].  Lets  create  the  data  that  we'll  be  using  to  learn  for  the  same:  

data  example;  input  grps  section  $  score;  cards;  1  a  10  1  a  20  1  b  30  1  b  40  2  a  50  2  a  60  2  b  0  2  b  -­‐10  ;  run;  

The  dataset  that  we  have  create  is  already  sorted,  hence  we  can  simply  use  "by"  statement  without  really  sorting  this.  When  we  use  "by"  statement;  "first."  and  "last."  will  create  temporary  variables  which  take  values  "1"  and  "0"  for  each  observation  depending  on  groups  created  by  variables  used  in  "by  statement".  Lets  look  at  this  example  given  below  to  understand  this  better:  

data  example;  set  example;  

Page 5: DP

 

 

by  grps;  first_grps=first.grps;  last_grps=last.grps;  run;    data  example1;  set  example;  by  grps  section;  first_section=first.section;  last_section=last.section;  run;  

In  the  first  program  we  used  "by  grps",  the  variable  "grps""  creates  two  groups  in  the  data,  one  for  the  value  "1"  and  another  for  the  value  "2".  The  variable  "first."  takes  value  "1"  for  the  first  observation  in  the  groups  and  "0"  for  others,  on  the  other  hand  "last."  variable  takes  value  "1"  for  the  last  observation  in  the  group  and  "0"  for  others.  

In  the  second  program  we  used  "by  grps  section",  this  makes  more  groups  in  the  data,  first.  and  last.  takes  values  "1"  and  "0"  accordingly.  

We  don't  really  need  to  create  these  first.  and  last.  variables  to  use  them,  in  the  programs  above  we  created  those  just  for  demonstration.  Lets  use  them  to  solve  a  similar  problem  which  we  did  for  the  bank_transaction  data.Lets  get  the  top  score  for  each  section.  

proc  sort  data=example;  by  grps  section  descending  score;  run;      data  top_example;  set  example;  by  grps  section;  if  first.section;  run;  

get  total  score  for  each  section:  

data  total_scores(drop  =  score);  set  example;  by  grps  section;  total_score+score;  if  first.grps  then  total_score=score;  if  last.grps  then  output;  run;  

In  a  similar  fashion  ,  we  can  solve  the  original  problem  that  we  solved  for  dataset  bank_transactions:  

Page 6: DP

 

 

proc  sort  data=dp.bank_transactions;  by  year  month  dc  descending  amount;  run;      data  bt_summary(drop=day  category);  set  dp.bank_transactions;  by  year  month  dc;  if  last.dc  then  output;  run;    data  bt_summary_total(drop=  amount  day  category);  set  dp.bank_transactions;  by  year  month  dc;  total_amount+amount;  total_transac+1;  if  first.dc  then  do;      total_amount=amount;          total_transac=1;  end;  if  last.dc  then  output;  run;  

Numeric  Functions  

Before  we  start  to  learn  about  SAS  functions,  lets  learn  about  a  way  to  "not"  create  a  dataset  every  time  we  just  want  to  see  what  a  function  does.  Handy  way  is  to  name  my  outgoing  dataset  simply  "null"  ,  this  tells  sas  not  to  create  any  dataset  in  the  data  step  program.  But  we  do  need  something  which  will  show  us  the  result  of  the  function  that  we  just  used.  "put"  statement  comes  to  rescue.  Put  statement  prints  whatever  we  ask  it  to  ,  in  the  log.  Remember  ,  not  in  the  result  window  but  in  the  log  window.  Lets  look  at  few  numeric  functions  available  in  the  SAS  system:  

data  _null_;  x=sqrt(2000000);  y=log(x);  z=sum(23,34,56);  put  x;  put  y;  put  z;  run;  

There  are  several  such  numeric  functions.  A  longer  list  can  be  found  here  :  

http://support.sas.com/documentation/cdl/en/imlug/59656/HTML/default/viewer.htm#langref_sect321.htm  

a  quick  list  that  comes  to  mind  is  this  :  log,  exp,  sqrt,  mean,  median,  sum,  n,  nmiss.  These  functions  do  what  the  name  sounds  like.  That  also  is  not  really  an  exhaustive  list.  In  fact  you  can  find  almost  all  direct  mathematical  formulas  that  you  use  in  the  

Page 7: DP

 

 

SAS  function  list  if  you  look  for  the  documentation.  We'll  not  be  going  through  all  the  function.  

One  important  thing  however  is  to  understand  that  data  processing  happens  in  SAS  row  by  row  not  column  by  column  lets  create  a  data  set  and  understand  how  these  functions  work  row  by  row  ;  not  column  by  column  .  

data  func;  input  x  y  z;  cards;  10  20  30  1  2  3  5.4  6.7  9.33  100  200  0  ;  run;  

now  lets  apply  some  numerical  functions  and  see  what  they  do.  

data  func;  set  func;  s1=sum(x);  s2=sum(x,y,z);  run;  

You  would  notice  that  the  variable  "s1"  above  is  not  containing  sum  of  the  entire  column  x.  In  fact  it  is  rather  containing  values  exactly  same  as  x.  why?  ,  because  these  functions  only  work  on  rows  ,  not  on  columns.  So  in  the  same  row,  there  is  only  one  value  of  x  to  be  summed,  and  the  result  is  just  x.  

Now  on  the  other  hand,  "s2"  is  sum  of  values  of  variables  x,y  and  z  in  the  same  row.  

Note:  you  must  be  wondering  ,  why  do  we  need  a  function  for  sum  when  we  can  use  the  algebraic  sign  "+"  for  the  same  purpose.  Well,  there  is  a  small  difference.  When  function  sum  encounters  a  missing  value  while  performing  addition,  it  ignores  it,  where  as  if  that  happens  while  using  "+"  operator  ,  you'll  get  a  missing  value  as  the  result.  Lets  see  an  example:  

data  _null_;  x=sum(10,20,30,.);  y=10+20+30+.;  put  x;  put  y;  run;  

String  Functions  

We  saw  that  most  of  the  numeric  functions  are  simply  named  as  their  mathematical  names.  These  names  readily  make  sense  and  tell  what  do  we  use  these  functions  for.  Same  is  not  the  case  for  string  functions,  or  functions  which  are  used  to  process  

Page 8: DP

 

 

character  variables.  We'll  talk  about  few  important  character  functions  in  detail  with  example.  

scan  

This  function  takes  a  string  as  input  .  Imagine  a  scenario  where  this  input  string  is  an  address  with  elements  of  it  such  as  home  number,  street  ,  city  etc  are  separated  by  "/".  Third  input  scan  function  is  this  "delimiter"  which  separates  different  elements  of  the  string  within  it.  Second  input  is  the  element  which  you  want  to  extract  from  the  string.  For  example  we  have  this  address:  

"1502/Panch  Mahal/Malad/Mumbai"  

And  we  want  to  extract  suburb  name  from  this  address  which  is  the  second  element  if  we  consider  "/"  to  be  the  delimiter  in  the  string.  Lets  see:  

data  _null_;  address="1502/Panch  Mahal/Malad/Mumbai";  suburb=scan(address,2,"/");  put  suburb;  run;  

Explore  Yourself:  Can  we  use  multiple  delimiters  with  scan?  

substr  

Function  substr  can  be  used  to  extract  a  substring  from  a  larger  string  if  we  know  position  of  start  and  end  of  the  said  substring  in  the  larger  input  string.  Keep  in  mind  that  counting  start  with  one  not  zero  as  seen  in  other  programming  languages.Here  are  few  examples  for  the  same:  

data  _null_;  IP="192.168.1.1:543";  port=substr(IP,5,3);  put  port;  run;    data  _null_;  IP="192.168.1.1:543,AutomatedMails";  port=substr(IP,13);  port1=substr(IP,13,3);  put  port;  put  port1;  run;  

Explore  Yourself:  What  happens  if  we  give  input  for  end  position  in  the  function  substr?  

Page 9: DP

 

 

trim  ,  strip  ,  ||  ,catx,compress  

Functions  named  above  and  operator  ||  are  used  remove  white  spaces[  trim  ,strip,compress]  from  the  input  string  in  various  ways  and  combining  them  [||,  catx].  We'll  learn  through  some  examples:  

data  _null_;  x="Lalit";  y="Sachan";  z=x||y;  m=x||"-­‐7@"||y;  put  z;  put  m;  run;  

You  can  see  that  operator  ||  [this  is  double  pipe  symbol]  simply  combines  strings.  Lets  look  at  white  space  removing  functions  and  peculiarities  associated  with  them.  

data  _null_;  x=trim("    Lalit    ");  y=trim("    Sachan    ");  z="@"||x||"@"||y||"@";  x_l=length(x);  y_l=length(y);  put  x_l;  put  y_l;  put  z;  run;  

You  can  see  that  in  above  example  none  of  the  spaces  get  removed.  This  is  a  peculiar  behavior  of  the  function  trim  .  If  you  use  function  trim  the  variable  value  assignment  directly  then  only  it  works.  It  removed  trailing  spaces  from  the  string.:  

data  _null_;  x="    Lalit    ";  y="    Sachan    ";  z="@"||trim(x)||"@"||trim(y)||"@";  put  z;  run;  

now  lets  look  at  how  strip  behaves.  We  are  using  length  function  to  check  if  trim/strip  functions  are  working  ,  in  addition  to  printing  them  in  log  using  "put"  function.  

data  _null_;  x=strip("    Lalit    ");  y=strip("    Sachan    ");  z="@"||x||"@"||y||"@";  put  z;  run;  

Page 10: DP

 

 

As  opposed  to  trim  function  ,in  the  above  example  strip  is  removing  leading  spaces  ,  let  see  how  it  behaves  when  used  directly  during  new  variable  creation.  

data  _null_;  x="    Lalit    ";  y="    Sachan    ";  z="@"||strip(x)||"@"||strip(y)||"@";  put  z;  run;  

in  this  case  it  removes  all  [not  the  ones  in  between]  the  spaces,  leading  and  trailing  */  

compress  

This  function  removes  all  spaces  from  the  string  ,  including  the  ones  which  are  in  between.  

data  _null_;  x="    Lalit      Sachan    ";  z="@"||compress(x)||"@";  put  z;  run;  

catx  

This  function  concatenates  strings  after  removing  leading  and  trailing  spaces  from  them.  First  argument  however  here  is  the  delimiter  which  will  be  used  while  combining  the  strings.  If  any  of  the  strings  to  be  combined  are  simply  white  spaces  they  are  ignored.  Here  is  an  example  to  make  you  understand  better.  Notice  how  to  white  space  is  simply  ignored,  while  creating  y.  In  both  the  cases  "$"  has  been  used  a  delimiter.  

data  _null_;  x=catx("$","    45      ","  ytfy        ","asdf    ");  y=catx("$","    xd        ",  "            ","dr          ");  put  x;  put  y;  run;  

Explore  Yourself:  Find  out  what  functions  "upcase"  and  "lowcase"  do?  Come  up  with  a  functioning  example.  

find  

This  function  is  used  to  find  the  starting  position  of  a  smaller  substring  in  a  larger  input  string.  Remember  that  counting  start  with  one  from  the  beginning  of  the  string.  The  first  argument  to  function  is  the  larger  string  where  we  aim  to  find  the  smaller  one.  Second  argument  is  the  string  which  we  are  looking  for  in  the  larger  one.  Third  argument  is  where  we  should  start  in  the  larger  string  to  look  for  the  

Page 11: DP

 

 

smaller  one.  If  that  number  is  "+ve"  then  search  is  done  from  left  to  right,  if  that  number  is  negative  ,  search  is  done  from  right  to  left.  However  returned  value  is  the  starting  position  of  the  smaller  string  from  the  beginning  of  the  larger  string  only.  

if  third  argument  is  left  blank,  then  by  default  search  starts  at  the  beginning  of  the  string  and  is  done  left  to  right.Also  note  that  if  there  are  multiple  occurrences  of  the  smaller  strings,  the  starting  position  of  that  occurrence  is  returned  which  is  encountered  first  depending  on  starting  position  and  direction  of  the  search  as  specified  by  various  inputs  of  the  function  Below  given  here  are  few  examples:  

data  _null_;  x="akjs@askj@asdkf@a";  z=find(x,"@a");  m=find(x,"@a",7);  k=find(x,"@a",-­‐17);  a=find(x,"@a",-­‐7);  b=find(x,"@a",17);  put  z;  put  m;  put  k;  put  a;  put  b;  run;  

Search  here  by  default  is  case  sensitive  as  can  be  seen  in  the  example  below.  "s"  is  not  found  because  the  letter  "S"  is  in  caps  in  the  larger  string.  

data  _null_;  x="SjdksdA";  y=FiNd(x,"s");  put  y;  run;  

If  you  want  your  search  to  be  case  insensitive,  you  need  to  use  the  identifier  "i".  The  first  and  second  arguments  are  meant  for  strings  to  be  searched  in  and  strings  to  be  searched  for  .  Beyond  that  "i"  means  identifier  i  which  makes  your  search  case  insensitive.  

data  _null_;  x="akjs@askIj@asdkf@a";  z=find(x,"@A");  m=find(x,"@A","i",7);  n=find(x,"i",7,"i");  put  m;  put  z;  put  n;  run;  

Explore  Yourself:  What  does  the  identifier  "t"  do  in  the  function  "find"?  

Page 12: DP

 

 

Tranwrd  

This  function  is  used  to  replace  substring  occurrences  in  the  larger  input  string.  In  the  example  given  below  we  are  replacing  all  hyphens  with  "/"  .  Second  argument  is  what  we  want  to  replace  and  the  third  is  what  we  want  to  replace  it  with.  Of  course  first  argument  being  the  string  where  we  want  to  do  these  replacements.  

data  _null_;  address="1203-­‐Some  Tower-­‐powai/Mumbai";  proper_add=tranwrd(address,"-­‐","/");  put  proper_add;  run;  

Here  is  an  exercise.  Run  the  code  given  below  to  create  the  dataset.:  data  Add;  length  address  $40;  input  address  $;  cards;  1604-­‐some-­‐chandiwali,Mumbai  12-­‐a/Delhi  First-­‐Street,Chennai  ;  run;    Once  that  is  done.  Create  a  column  in  the  dataset  which  contains  city  names  extracted  from  these  address.  Do  that  using  whatever  functions  you  think  are  going  to  be  appropriate  for  the  process.  

Exercise  Solution:    data  add(drop=a1  a2  z);  set  add;  a1=tranwrd(address,"-­‐",",");  a2=tranwrd(address,"/",",");  z=find(a2,",",-­‐length(a2));  city=substr(a2,z+1);  run;  

Utility  Functions  and  Procedures  

In  addition  to  numeric  and  string  functions  there  are  many  more  utility  procedures  in  SAS  which  enable  us  to  do  many  other  tasks  other  than  simply  extracting  or  transforming  numeric  or  categorical  variables.  

Input  

This  functions  is  used  to  apply  a  specific  format  while  creating  a  new  variable.  Remember  that  it  can  not  be  used  to  change  format  of  existing  variables.  

data  temp;  x="12/01/2013";  

Page 13: DP

 

 

run;    /*"  In  the  data  set  temp  above,  x  is  essentially  a  string  as  can  be  confirmed  by  looking  at  its  type,  now  we  can  apply  a  date  format  on  this  to  create  another  variable  which  contains  the  same  values  but  "  data  temp;  set  temp;  format  y  mmddyy10.;  y=input(x,ddmmyy10.);  put  y;  run;  

Many  at  times  it  happens  that  variable  which  is  supposed  to  be  in  numeric  format  comes  out  to  be  in  character  format  while  importing  that  data  due  to  presence  of  some  character  values.  We  can  use  input  function  to  convert  this  variable  into  a  numeric  one  by  applying  format  "8.".  Lets  see  an  example  of  doing  the  same:  

data  temp;  input  some  $;  cards;  10  20  30  a  b  12  13  14  ;  run;  

If  you  look  at  type  of  variable  "some"  in  the  data  temp,  it  is  character.  Lets  convert  that  to  numeric  variable.  

data  temp;  set  temp;  some_num=input(some,8.);  run;  

smallest  ,  largest  

Function  min  and  max  always  give  largest  and  smallest  value  ,  however  at  times  we  might  need  n!"  largest  or  smallest  value  among  many.  For  that  we  can  use  smallest  or  largest  functions.  First  argument  to  these  function  is  the  value  of  "n".  Example  given  below  get  3rd  largest  and  3rd  smallest  values  from  the  data  respectively.  

data  _null_;  x=smallest(3,23,1,4,-­‐5,7,0,10);  y=largest(3,23,1,4,-­‐5,7,0,10);  put  x;  

Page 14: DP

 

 

put  y;  run;  

Lag  

Since  by  default  SAS  processes  data  row  by  row,  there  is  no  direct  method  to  access  previous  observations  in  data  step.  For  doing  so  we  have  to  use  lag  function  which  is  designed  do  specifically  this:  

data  temp;  input  A  $  B  C;  cards;  truck  10    1  truck  20    2  truck  30    3  car      40    4  car      50    5  car      60    6  ;  run;    data  temp;  set  temp;  D=lag(B);  run;  

You  can  see  that  new  variable  "D"  is  simply  take  previous  values  of  variable.  Or  in  other  words  its  equivalent  to  column  "B"  with  one  lag.  You  can  apply  lag  function  with  multiple  lags  too  by  using  function  lagn.  Following  is  an  example  with  lag3.  

data  temp;  set  temp;  D=lag3(B);  run;  

However  this  gets  tricky  if  you  use  the  function  lag  inside  a  condition.  In  that  case  lag  function  returns  only  those  values  which  it  gets  to  see  within  the  condition  block.  Here  is  is  example.  Try  to  understand  this  and  if  doesn't  make  sense  ask  for  a  detailed  explanation  in  the  class:  

proc  sort  data=temp;  by  A;  run;    data  temp;  set  temp;  by  A;  new_var=first.A;  if  first.A  then  D=lag(B);  else  D=lag(C);  run;  

Page 15: DP

 

 

Round  

Round  function  is  used  to  round  off  digits  for  numeric  values.  First  argument  is  the  value  being  rounded  off  and  second  argument  is  indicator  for  the  rounding.  

data  _null_;  x=123.45567;  y=round(x);  z=round(x,0.001);  put  z;  put  y;  run;  

in  the  above  example  ,  second  input  is  .001  which  means  x  will  rounded  off  up  to  3rd  digit  after  decimal.  You  can  consider  the  process  like  this.  First  x  is  divided  by  .001,  rounded  off  to  nearest  integer  and  then  multiplied  by  .001.  

So  x/.001  =  123455.67,  this  being  rounded  off  to  nearest  integer  becomes  123456  this  again  gets  multiplied  by  .001  and  becomes  123.456  

lets  take  few  more  examples:  

data  _null_;  x=123.45567;  y=round(x,0.1);  z=round(x,100);  m=round(x,10);  put  m;  put  z;  put  y;  run;  

consider  m=round(x,10),  first  x  gets  divided  by  100  which  becomes  12.345567  then  it  gets  rounded  off  to  nearest  integer  which  is  12,  then  it  gets  multiplied  by  10  and  becomes  120,  which  is  the  final  value  of  m.  

Explore  Yourself:  Do  the  above  the  process  for  y  and  z  also  and  see  whether  the  final  values  match  with  what  your  calculations.  

Proc  Rank  

Proc  rank  is  used  to  make  bins  in  your  data.  You  can  use  a  numeric  variable  by  which  you  want  to  make  bins  in  the  data.  For  example  in  the  data  set  sashelp.cars  ,  we  want  to  make  bins  in  the  data  by  variable  invoice.  What  happens  is  that  data  is  sorted  by  variable  invoice  and  then  starting  from  top  equal  numbers  of  observations  are  put  into  each  bin.  

proc  rank  data=sashelp.cars  out=car_rank  group=10;  var  invoice;  

Page 16: DP

 

 

ranks  basket;  run;  

groups=10  tells  proc  rank  there  are  going  to  10  bins/groups  in  the  data.  "ranks  basket":  this  names  the  variable  containing  group/bin  number  as  "basket".  Bin  numbering  starts  with  0.  

Proc  transpose  

This  is  used  to  make  your  data  from  long  to  wide  or  wide  to  long  as  discussed  before.  Lets  create  the  same  data  which  we  showed  there  

data  long1  ;        input  famid  year  faminc  ;    cards  ;    1  96  40000    1  97  40500    1  98  41000    2  96  45000    2  97  45400    2  98  45800    3  96  75000      3  98  77000    ;    run;  

Following  program  using  proc  transpose  converts  the  long  format  data  into  wide:  

proc  transpose  data=long1  out=wide1    prefix=year_;          by  famid  ;          id  year;          var  faminc;  run;  

by  statement:  makes  rows  based  on  how  many  unique  values  the  specified  variable  in  the  by  statement  has  

id  statement:  makes  columns  based  how  many  unique  values  the  specified  variable  in  the  id  statement  has  

var  statement  :  fills  the  values  of  variable  specified  in  the  var  statement  in  the  resulting  cells  of  transposed  dataset.  If  some  cells  don't  have  a  corresponding  values  in  the  incoming  dataset  they  are  assigned  missing  values  such  as  cell  corresponding  to  year  97  and  famid  3  in  the  above  example.  

Now  next  question  that  might  be  bothering  you  must  be  what  happens  if  there  are  more  than  one  variables  to  filled  in,  you  simply  get  multiple  rows  corresponding  to  each  value  of  variable  in  "by  statement".  For  example  in  the  example  given  below  you  get  2  rows  for  each  famid.  

Page 17: DP

 

 

data  long2;        input  famid  year  faminc  spend  ;    cards;    1  96  40000  38000    1  97  40500  39000    1  98  41000  40000    2  96  45000  42000    2  97  45400  43000    2  98  45800  44000    3  96  75000  70000    3  97  76000  71000    3  98  77000  72000    ;    run  ;    proc  transpose  data=long2  out=wides  ;        by  famid;        id  year;        var  faminc  spend;  run;  

Proc  Format  

Proc  format  is  used  to  create  user  defined  format.  This  does  not  require  any  input  from  a  dataset  and  create  format  can  be  applied  on  any  variable  in  any  dataset.  Here  is  an  example  given  below.  Also  it  does  not  change  underlying  format  of  the  variable,  it  only  changes  how  it  is  displayed.  

proc  format;  value  $jc  'one'='Management'                        'two'='Trainees';  value  Grade  0-­‐32="F"                          33-­‐45="C"                          46-­‐58="B"                          60-­‐100="A";  run;  

"value"  statement  here  is  the  one  which  essentially  creates  the  format  for  you.  If  this  format  is  going  to  be  *applied  on  on  character  values  then  the  format  name  starts  with  a  "$"  sign  otherwise  the  name  starts  as  usual.  Naming  constraints  for  formats  is  same  as  variable  names.  in  the  value  statement  given  above  we  created  format  $jc,  if  we  apply  it  on  a  categorical  variable  and  the  value  is  "Management"  then  displayed  value  will  be  'one'  and  'two'  if  the  value  is  "Trainees".  If  the  value  does  not  match  with  either  of  the  "Management"  or  "Trainee"  then  value  will  displayed  as  is.  

For  the  numeric  format  Grade  ,  if  the  numeric  variable  on  which  it  is  being  applied,  is  in  the  range  0-­‐32  then  "F"  will  be  displayed,  if  any  of  the  values  does  not  match  with  the  given  ranges  then  a  *  will  be  displayed  in  its  place.  Lets  see  an  example  of  these  formats  being  applied  on  the  data  set  temp.  To  emphasize  that  the  underlying  values  don't  change  i  have  also  created  a  numeric  variable  in  the  same  data  step.  

Page 18: DP

 

 

data  temp;  input  jobs  $  marks;  cards;  one    10  two    75  one      34  two      59  abc    79  one    49    one      56  two    90  abc    20  ;  run;    data  temp;  set  temp;  format  jobs  $jc.;  format  marks  grade.;  marks2=marks/2;  run;  

Proc  SQL  

This  is  implementation  of  SQL  language  with  in  SAS.  All  of  the  tasks  which  we'll  see  here  can  be  achieved  with  whatever  we  have  learned  so  far.  SQL  language  queries  are  however  at  times  easy  to  read  and  write.  But  do  not  use  them  with  large  dataset.  They  might  not  be  as  fast  as  their  data  step  counterparts.  

You  will  see  that  SQL  queries  are  very  English  like  to  write.  They  are  mostly  used  to  subset,summarize  and  pre-­‐process  the  data.  There  are  no  predictive  modeling  procedures  in  SQL  framework.  

We'll  see  that  all  SQL  queries  are  just  select  statements.  These  select  statements  have  incremental  capacities  which  we'll  see  starting  with  the  simplest  form  where  you  select  all  the  observation  from  the  incoming  dataset.  All  SQL  queries  are  going  to  be  in  a  block  starting  with  "proc  sql"  and  closed  with  "quit".  Result  of  the  selection  will  be  displayed  in  result  window.  If  we  want  to  put  the  result  of  selection  in  a  data  set  we  can  simple  add  "create  table  as  table_name  "  in  front  of  the  select  statement.  Lets  see  some  example  for  the  same.  

proc  sql;  select  *  from  sashelp.cars;  quit;  

All  observations  from  sashelp.cars  are  displayed  in  result  window.  

proc  sql  ;  create  table  lalit  as  select  *  from  sashelp.cars;  quit;  

Page 19: DP

 

 

All  obs  are  still  displayed  but  a  table  named  "lalit"  is  created  in  the  work  library  [you  can  supply  a  lib  ref  for  it  to  be  createdin  some  other  location]  with  all  the  observations.  Here  on  wards  we'll  not  use  create  table,  whenever  you  want  to  do  that  ,  simply  add  that  part  in  front  of  select  statement.  

If  you  do  not  want  to  select  columns  of  the  data  you  restrict  by  mentioning  the  variable  names  separated  by  comma.  

proc  sql;  select  name,nhits  from  sashelp.baseball;  quit;  

This  controls  number  of  variables/columns  which  you  are  selecting  from  the  dataset.now  what  if  i  want  to  restrict  number  of  observations  There  are  many  ways  to  do  it.  

proc  sql  inobs=10;  select  name  from  sashelp.baseball;  select  make  from  sashelp.cars;  quit;  

using  inobs/outobs  with  proc  sql  statements  restrict  number  of  incoming/outgoing  observations  for  all  the  select  statements  in  that  block.  If  we  want  to  restrict  number  of  obs  selectively  for  each  select  statement  separately  we  can  do  the  following.  

proc  sql;  select  name  from  sashelp.baseball(obs=10);  select  make  from  sashelp.cars(obs=20);  quit;  

There  is  also  an  option  called  outobs.  Outobs  specifies  number  of  observation  which  go  out.  In  the  current  example  it  works  same  as  inobs  but  when  you  are  processing  data  it  behaves  differently.  

proc  sql  outobs=10;  select  name  from  sashelp.baseball;  quit;  

As  we  saw  in  data  step,  just  restricting  number  of  observations  is  not  enough,  We  need  some  way  to  conditionally  filter  observation.  We  can  achiever  that  by  using  "where  "  with  select  statement  as  following:  

proc  sql;  select  invoice,drivetrain  from  sashelp.cars    where  origin="Asia";  quit;  

we  can  write  multiple  conditions  as  well  by  combining  them  with  and,  or  operators.  

proc  sql;  create  table  temp  as  select  invoice,origin,drivetrain,type,mpg_city    from  sashelp.cars  

Page 20: DP

 

 

where  origin="USA"  and  type="Sedan"  and  mpg_city>15;  quit;  

Remember  that  you  don't  need  to  necessarily  select  the  variable  on  which  you  apply  conditional  statement.  Next  requirement  is  to  sort  the  data,  for  that  we'd  add  order  by  to  our  select  statement.  

proc  sql;  select  invoice,origin  from  sashelp.cars  order  by  invoice;  quit;  

default  order  of  sorting  is  ascending.  If  you  want  to  sort  things  in  descending  order  then  you'll  have  to  use  the  keyword  desc  as  given  below  :  

proc  sql;  select  invoice,origin  from  sashelp.cars  order  by  invoice  desc;  quit;  

you  can  order  by  multiple  variables  as  well:  

proc  sql;  select  origin,msrp  from  sashelp.cars  order  by    origin,msrp  desc  ;  quit;  

Now  next  is  to  group  variables  or  get  aggregated/summary  statistics  such  as  mean  std  etc  which  are  defined  for  a  group  of  values  rather  than  individual  observation.  

proc  sql  ;  select    origin,drivetrain,mean(msrp)  as  msrp_avg  from  sashelp.cars  group  by  origin,drivetrain;  quit;  

Here  the  summary  operations  [  such  as  calculating  mean  in  the  above  example]  is  carried  out  on  the  groups  created  by  "group  by".  Here  are  few  more  examples  ,  one  which  include  order  by  as  well.s  

proc  sql  ;  select  origin,  std(msrp)  as  price_std  from  sashelp.cars    group  by  origin;  quit;    proc  sql  ;  select  make,  std(msrp)  as  price_var  from  sashelp.cars    group  by  make  order  by  price_var;  quit;  

now  if  we  wanted  to  put  condition  here  on  the  new  var  which  is  created  [price_var];  lets  see  if  simple  where  condition  works  :  

proc  sql  ;  select  make,  std(msrp)  as  price_var  from  sashelp.cars    

Page 21: DP

 

 

where  price_var>10000  group  by  make  order  by  price_var;  quit;  

above  mentioned  code  throws  an  error:  

ERROR:  The  following  columns  were  not  found  in  the  contributing  tables:  price_var.  

To  apply  conditions  on  the  variables  which  are  created  in  sql  queries  we  need  to  use  "having"  

proc  sql  ;  select  make,  std(msrp)  as  price_var  from  sashelp.cars      group  by  make  having(price_var>10000)  order  by  price_var  ;  quit;  

sequence  in  which  you  should  write  :where  >  group  by  >  having  >  order  by.  Next  we'll  see  how  to  get  data  from  multiple  tables.  

libname  dp  "/folders/myfolders/Datasets/Data  Prep";  

Key  is  to  give  names  to  tables  which  can  be  use  to  reference  table  while  extracting  those  columns  from  it.  We'll  try  to  solve  following  case  which  involves  getting  data  from  multiple  tables.  

case:  datasets  gaming1,2,3  contain  information  on  customers  of  a  gaming  company  which  provides  online  platform  for  playing  team  games  such  as  AOE,  DOTA  ,  CS  .    we  want  to  get  those  customers  ids  which  play  DOTA  on  mac  os  in  solo  sessions  with  free  license  type  and  their  average  time  per  session  is  more  than  40  minutes  

Lets  first  list  what  information  stored  where:  

gaming1=gamer_id,  game  name,  atps  gaming2=  gamer_id  ,  os  ,  license    gaming3=  gamer_id,  session_type,  netspeed    

We'll  give  names  to  tables  in  select  statement  only,  i  have  written  following  select  statement  in  multiple  lines  for  better  readability.  

proc  sql;  select  a.gamer_id      from  dp.gaming1  as  a,  dp.gaming2  as  b,  dp.gaming3  as  c      where    b.os="mac"  and  a._game_name="dota"  and  

Page 22: DP

 

 

a.atps>40  and  c.session_type="solo"  and  b.license="free"  and    a.gamer_id=b.gamer_id  and  a.gamer_id=c.gamer_id  ;    quit;  

The  part  "a.gamer_id=b.gamer_id  and  a.gamer_id=c.gamer_id"  is  must  for  setting  up  correspondence  between  observations  of  multiple  tables.  If  you  don't  do  that  you'll  get  a  cross  product  of  observation  as  shown  below:  

data  s1;  input  id    a  $;  cards;  1  q  2  a  3  z  ;  run;      data  s2;  input  id  b  $;  cards;  1  p  2  l  3  m  ;  run;    proc  sql;  select  a,b  from  s1,s2;  quit;  

Now  if  we  put  that  correspondence  setting  where  condition  we'll  get  the  desired  result.  

proc  sql;  select  a,b  from  s1,s2  where  s1.id=s2.id;  quit;  

Explore  Yourself:  *  How  to  join/merge  tables  using  SQL  *  What  do  distinct,  count  do  when  used  with  SQL  queries  

We'll  conclude  here.  In  case  of  any  doubts  regarding  content  of  this  study  material,  please  post  on  QA  forum  in  LMS.  

Page 23: DP

 

 

Prepared  By:  Lalit  Sachan  

Contact:  [email protected]