Transcript
ISQS 6347, Data & Text Mining 1
ISQS 3358, Business Intelligence
Data Preparation for AnalyticsUsing SAS
Zhangxi LinTexas Tech University
ISQS 6347, Data & Text Mining 2
Outline
Exercise 9 An overview of data preparation for
analytics SAS Programming Essentials
Running SAS programs Mastering fundamental concepts SAS program debugging
Exercise 10
ISQS 6347, Data & Text Mining 3
Exercise 9: Creating Summarized Output
Chapter 5 exercises: p A-15 to A-201. Producing summary statistics2. Creating a custom format3. Producing a summary table4. Optional: Displaying percentage in a table5. Producing a bar chart
Deliverables: Screenshot of the Project Designer Screenshots of the above html outputs (not
necessary to show the complete output)
ISQS 6347, Data & Text Mining 4
Structure and Components of Business Intelligence
ISQS 6347, Data & Text Mining 5
Overview: From Data Warehousing to Data Analysis
Previous major topics in data warehousing (using SQL Server 2005) Dimensional model design ETL Cubes design and OLAP
Data analysis topics (using SAS) Data preparation
Analytic business questions Data format and data conversion
Data quality Data exploratory Data analysis
ISQS 6347, Data & Text Mining 6
US Car Theft
The number of U.S. motor vehicle thefts decreased by 1.9 percent from 2003 to 2004, the first decrease since 1999. In 2004, the value of stolen motor vehicles was $7.6 billion, down from $8.6 billion in 2003. The average value of a motor vehicle reported stolen in 2004 was $6,143, compared with $6,797 in 2003.
ISQS 6347, Data & Text Mining 7
2004 Theft Statistics Every 26 seconds, a motor vehicle is stolen in the United
States. The odds of a vehicle being stolen were 1 in 190 in 2003. The odds are highest in urban areas.
U.S. motor vehicle thefts fell 1.9 percent in 2004 from 2003, according to the FBI's Uniform Crime Reports. In 2004, 1,237,114 motor vehicles were reported stolen.
The West was the only region with an increase in motor vehicle thefts from 2003 to 2004, up 3.2 percent. Thefts fell 9.7 percent in the Northeast, 4.4 percent in the Midwest and 2.9 percent in the South.
Nationwide, the 2004 motor vehicle theft rate per 100,000 people was 421.3, down 2.9 percent from 433.7 in 2003.
Only 13.0 percent of thefts were cleared by arrests in 2004. Carjackings occur most frequently in urban areas. They
account for only 3.0 percent of all motor vehicle thefts. The average comprehensive insurance premium in the U.S.
rose 11.2 percent from 1999 to 2003
ISQS 6347, Data & Text Mining 8
Business Question
If the number of used Honda Accord thefts is ranked the top in auto theft, should the premium of insurance for Honda Accord be high enough than other brand of cars? Should the insurance for a user Honda higher than a brand new Honda?
Why?
ISQS 6347, Data & Text Mining 9
Other Analytic Business Questions How do factors such as time, branch, promotion, and price
influence the sale of a soft drink? Which customers have a high cancellation risk in the next
month? How can customers be segmented based on their purchase
behavior? Statistics showed that an online recommendation system may
increase the sale 20%, and the accuracy rate of the system is 40%. A newer algorithm can increase the accuracy rate to 50%. Should the sale be promoted to 20%*125% = 25%?
The airline companies are considering allowing seats over-booked because certain percentage of customers will cancel their flight at the last minute. If the average cancellation rate is 10%, should the over-booking rate be 10% as well? If a cancellation is charged 5% of the fare and how much should the penalty for sold-out situation with over-booking?
ISQS 6347, Data & Text Mining 10
Analysis Process Selecting an analysis method Identify data source Prepare the data (collecting, cleansing, reorganizing,
extracting transforming, loading) Execute the analysis Interpret the analysis Automate data preparation and execution of analysis, if
the business question has to be answered more than once
ETL Stored procedures
The above steps can also be iterated, not necessarily performed in sequential order
We focus on the data preparation step
ISQS 6347, Data & Text Mining 11
Characteristics of Analytic Business Questions
Analysis complexity: real analysis or reporting Analysis paradigm: statistics or data mining Data preparation paradigm: as much as data as
possible or business knowledge first Analysis method: supervised or unsupervised
analysis Scoring needed – yes/no Periodicity of analysis: one-shot or re-run Historic data needed, yes/no Data structure: one row or multiple rows per
subject Complexity of the analysis team
ISQS 6347, Data & Text Mining 12
Components of the SAS System
ReportingAnd
Graphics
Data AccessAnd
ManagementUser
Interface
Analytical Base SAS ApplicationDevelopment
VisualizationAnd Discovery
BusinessSolutions
WebEnablement
ISQS 6347, Data & Text Mining 13
SAS Programming Essentials
Find more information from http://support.sas.com
ISQS 6347, Data & Text Mining 14
Data-driven Tasks
The functionality of the SAS System is built around four data-driven tasks common to virtually any applications Data access Data management Data analysis Data presentation
ISQS 6347, Data & Text Mining 15
Turning Data into Information
Process of delivery meaningful information 80% data-related
Access Scrub Transform Mange Store and retrieve
20% analysis
ISQS 6347, Data & Text Mining 16
SAS Programs
Raw Data
SAS DataSet
SAS DataSet
DATAStep
PROCStep
Report
DATA steps are typically used to create SAS data sets
PROC steps are typically used to process SAS data sets
A SAS program is a sequence of steps that the users submits for execution
ISQS 6347, Data & Text Mining 17
Preparation of SAS Programming
Data sets: \SAS-Programming Create a user defined library reference
StatementLIBNAME libref ‘SAS-data-library’ <options>;
Example LIBNAME ia ‘c:\workshop\winsas\prog1’;
Two-levels of SAS files namesLibref.fielname
ISQS 6347, Data & Text Mining 18
SAS Programming Essentials
Demon: c02s2d1 Exercise: c02ex1
ISQS 6347, Data & Text Mining 19
SAS Data Sets
SAS data sets have a description portion and a data portion. The description portion contains the
general information about the SAS data set and variable attributes
The CONTENTS procedure displays the descriptor portion of a SAS data set
PROC CONTENTS DATA=SAS-data-set;RUN;
ISQS 6347, Data & Text Mining 20
SAS Data Sets: Data Portion
The PRINT procedure displays the data portion of a SAS data set
PROC PRINT DATA=SAS-data-set;RUN;
ISQS 6347, Data & Text Mining 21
SAS Data Set Terminology
SAS data set = SAS table Variable = Column Observation = Row
ISQS 6347, Data & Text Mining 22
SAS Comments
/* begins a comment */ ends a comment
ISQS 6347, Data & Text Mining 23
SAS Program Debugging
Demon: c02s4d1 Exercise: c02ex7
ISQS 6347, Data & Text Mining 24
Outline of SAS Programming – Level I
Fundamentals (ch1-3) Producing list reports (ch4) Enhancing output (ch5) Creating data sets (ch6) Data step programming (ch7)
Reading data Creating variables Conditional processing Keeping and dropping variables Reading Excel files
Combining SAS data sets (ch8) Producing summary reports (ch9) SAS graphing (ch10)
ISQS 6347, Data & Text Mining 25
Making Use of SAS Enterprise Guide Code
Import a text file Example: Orders.txt
Import an Excel file Example: SupplyInfo.xls
ISQS 6347, Data & Text Mining 26
Learn from Examples
SAS Help Contents -> Learning to use SAS ->
Sample SAS Programs -> Base SAS “Base Usage Guide Examples”
Chapter 3, 4
ISQS 6347, Data & Text Mining 27
Import an Excel Sheet
proc import out=work.commrexdatafile ="C:\Lin\Shared\ISQS6339\Commrex_3358.xls" dbms=excel replace;
sheet="Company";getnames=yes;mixed=no;scantext=yes;usedate=yes;scantime=yes;run;proc print data=work.commrex;run;
ISQS 6347, Data & Text Mining 28
Excel SAS/ACCESS LIBNAME Engine
libname xlsdata 'C:\Lin\Shared\ISQS6339\Commrex_3358.xls';
proc print data=xlsdata.New1;run;
ISQS 6347, Data & Text Mining 29
Exercise 10: SAS Data Step Programming
Import and print text file Userlog00.txt Import and print Excel sheet customers in
Afs_customers.xls Send the Word file containing the
Screenshots of two outputs to zhangxi.lin@hotmail.com
top related