Top Banner
1 Business System Analysis & Decision Making Data Mining and Web Mining Zhangxi Lin ISQS 5340 Summer II 2006
57
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cases: Banking credit record

1

Business System Analysis &

Decision Making – Data

Mining and Web Mining

Zhangxi LinISQS 5340

Summer II 2006

Page 2: Cases: Banking credit record

2

Outline

Introduction to data mining & text mining Constructing a decision tree using SAS

Enterprise Miner Web mining

Page 3: Cases: Banking credit record

3

Data Mining and Text Mining

Page 4: Cases: Banking credit record

4

Review - Decision Tree (1)

Total: 10Accept: 4Reject: 6

Accuracy: 40%Coverage: 100%

Gender

Female

Male

Total: 5Accept: 3Reject: 2

Accuracy: 60%Coverage: 75%

Total: 5Accept: 1Reject: 4

Accuracy: 20%Coverage: 25%

Credit CardInsurance

Yes

No

Total: 2Accept: 2Reject: 0

Accuracy: 100%Coverage: 50%

Total: 3Accept: 1Reject: 2

Accuracy: 33.3%Coverage: 25%

Page 5: Cases: Banking credit record

5

Review - Decision Tree (2)

Total: 10Accept: 4Reject: 6

Accuracy: 40%Coverage: 100%

Gender

Female

Male

Total: 4Accept: 3Reject: 1

Accuracy: 75%Coverage: 75%

Total: 6Accept: 1Reject: 5

Accuracy: 16.7%Coverage: 25%

Credit CardInsurance

Yes

No

Total: 2Accept: 2Reject: 0

Accuracy: 100%Coverage: 50%

Total: 2Accept: 1Reject: 1

Accuracy: 50%Coverage: 25%

What are the differences of this decision tree from the last one?

Page 6: Cases: Banking credit record

6

Confusion Matrix (Rule: “Gender=Female”)

ActualAccept

ActualReject

Computed Accept

Computed Reject

3

42

1

5Accuracy = 3 / (2+3)

=0.6

5

Coverage= 3 / (3 + 1)= 0.75

Page 7: Cases: Banking credit record

7

Confusion Matrix (Rule: “Credit Promotion = Yes”)

ActualAccept

ActualReject

Computed Accept

Computed Reject

3

51

1

4Accuracy = 3 / (1+3)

=0.75

6

Coverage= 3 / (3 + 1)= 0.75

Page 8: Cases: Banking credit record

8

Generalizing data analysis ideas

Question: How to useful rule from a large amount of data generated in business operations?

Answer: Applying data mining techniques/tools

Page 9: Cases: Banking credit record

9

What is Data Mining? (See Wikipedia data mining)

Many Definitions Non-trivial extraction of implicit, previously unknown

and potentially useful information from data Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 10: Cases: Banking credit record

10

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality

of data Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 11: Cases: Banking credit record

11

Lots of data is being collected and warehoused Web data, e-commerce purchases at department/

grocery stores Bank/Credit Card

transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Mine Data? Commercial Viewpoint

Page 12: Cases: Banking credit record

12

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarray s generating gene

expression data scientific simulations

generating terabytes of data Traditional techniques infeasible for raw

data Data mining may help scientists

in classifying and segmenting data in Hypothesis Formation

Page 13: Cases: Banking credit record

13

Data Mining Tasks

Prediction Methods Use some variables to predict unknown or

future values of other variables.

Description Methods Find human-interpretable patterns that

describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 14: Cases: Banking credit record

14

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]

Page 15: Cases: Banking credit record

15

What Text Mining Is (See Wikipedia text mining)

Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.

“SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.” (SAS Text Miner: Distilling Textual Data for Competitive Business Advantage)

Page 16: Cases: Banking credit record

16

A simple text mining example

A tiny case - 9 documents deposit the cash and check in the bank - Fin the river boat is on the bank - Riv borrow based on credit - Fin river boat floats up the river - Riv boat is by the dock near the bank - Riv with credit, I can borrow cash from the bank - Fin boat floats by dock near the river bank - Riv check the parade route to see the floats - Par along the parade route - Par

Page 17: Cases: Banking credit record

17

Text Mining Strengths

Clustering documents in a corpus Investigating word (token) distribution across

documents within a corpus Identifying words with the highest discriminatory

power Classifying documents into predefined categories Integrating text data with structured data to enrich

predictive modeling endeavors

Page 18: Cases: Banking credit record

18

Text Mining Deficiencies

Text mining algorithms perform poorly in distinguishing negations, for example: Herman was involved in a motor vehicle accident. Herman was NOT involved in a motor vehicle accident

Text mining cannot generally make value judgments, for example, classifying an article as positive or negative with respect to any tokens it contains.

Text mining algorithms do not work well with large documents. Performance is slow. Increased term occurrence across documents decreases

separation of documents.

Page 19: Cases: Banking credit record

19

Using Data Mining Tools

Statistics Analysis System (http://www.sas.org) “SAS®9 is the most recent release of SAS. It delivers analytical, data manipulation and reporting capabilities within a completely new framework. ”

SPSS (http://www.spss.com) “SPSS customers include telecommunications, banking, finance, insurance, healthcare, manufacturing, retail, consumer packaged goods, higher education, government, and market research. ”

Weka, an open source software product (http://www.cs.waikato.ac.nz/ml/weka/ )

Microsoft SQL Server comes with major data mining utilities

There are more…

Page 20: Cases: Banking credit record

20

Using SAS Enterprise Mine to Construct A Decision Tree

Page 21: Cases: Banking credit record

21

SAS Enterprise Miner 4.3

Basic How to use the application main menu Using the pop-up menus Enterprise Miner documentation Project – Diagram

The SEMMA methodology Sample Explore Modify Model Assess

Page 22: Cases: Banking credit record

22

Exercise 5.0

Explore SAS and SAS Enterprise Miner

Page 23: Cases: Banking credit record

23

Decision Tree Example

Life Insurance Promotion Dataset CreditProm

Page 24: Cases: Banking credit record

24

Life Insurance Promotion Data

Income RangeMagazine Promo Watch Promo Life Ins Promo

Credit Card Ins. Sex Age

40-50,000 Yes No No No Male 45

30-40,000 Yes Yes Yes No Female 40

40-50,000 No No No No Male 42

30-40,000 Yes Yes Yes Yes Male 43

50-60,000 Yes No Yes No Female 38

20-30,000 No No No No Female 55

30-40,000 Yes No Yes Yes Male 35

20-30,000 No Yes No No Male 27

30-40,000 Yes No No No Male 43

30-40,000 Yes Yes Yes No Female 41

40-50,000 No Yes Yes No Female 43

20-30,000 No Yes Yes No Male 29

50-60,000 Yes Yes Yes No Female 39

40-50,000 No Yes No No Male 55

20-30,000 No No Yes Yes Female 19

Page 25: Cases: Banking credit record

25

Training Datax1

0.7

Missing in left branchMissing in right branch

Best Split x1

Tree Algorithm: Find Best Split for Input

X1 (Credit Prom)

Consider that the consumers in the life insurance promotion dataset havetwo attributes: credit card promotion, gender.

Page 26: Cases: Banking credit record

26

Training Datax2

0.7

Missing in left branchMissing in right branch

Logworth

Tree Algorithm: Repeat for Other Inputs

X2 (Gender)

Kass Adjusted

Page 27: Cases: Banking credit record

27

Training Data

0.7

Missing in left branchMissing in right branch

Best Split x2

Tree Algorithm: Compare Best Splits

x2

Best Split x1

x1

Page 28: Cases: Banking credit record

28

Training Data

Best Split

Tree Algorithm: Partition with Best Split

x1

x2

Page 29: Cases: Banking credit record

29

Training Data

Tree Algorithm: Repeat within Partitions

x1

x2

Page 30: Cases: Banking credit record

30

Training Data

Tree Algorithm: Partition with Best Split

x1

x2

Page 31: Cases: Banking credit record

31

Training Data

Tree Algorithm: Construct Maximal Tree

x1

x2

Page 32: Cases: Banking credit record

32

OverfittingOverfitting

Overfitting: The tree is split too much and the classification error rate is getting higher

We use training datasetto find the decision rules.These must be applicable to other datasets.

In order to test the validityof the rules, a test dataset is used.

Compare the outcomesbetween these two datasets, we can identify any inconsistency andcreate a good decision tree.

Page 33: Cases: Banking credit record

33

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Page 34: Cases: Banking credit record

34

How to Address Overfitting

Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree We typically use two datasets:

Training dataset for growing the decision tree and obtaining rules

Test dataset for testing if the rules are good enough with regard to the errors rate when applying the rules from training dataset to the test dataset.

If there is no test dataset, the original dataset will be partitioned into two subsets for the above purpose.

Page 35: Cases: Banking credit record

35

Exercise 5

Download the Life Insurance Promotion dataset (CreditProm)

Import the data to SAS Try out SAS Decision Tree modeling

Page 36: Cases: Banking credit record

36

SAS Data Mining Example

A German Bank’s Credit Data Online SAS materials (View PDF (2.24MB))

P70, dataset description P71, decision matrix

Page 37: Cases: Banking credit record

37

Web Mining

Page 38: Cases: Banking credit record

38

Case study: CarPort.com

CarPort.com is a fictitious Web site that is used to illustrate

components of Web site design and Web log analysis

a services Web site.

Page 39: Cases: Banking credit record

39

CarPort.com

Visitor profile could be any of the following: 1. buyer looking for a car 2. seller looking to sell a used car 3. curious information seeker 4. competitor 5. robot or spider 6. lost Web surfer 7. SAS course developer.

Page 40: Cases: Banking credit record

40

CarPort.com

Services: car locator (want ads) car ownership information

Sources of revenue: banner ads used car ads partnership agreements (fee for referral)

Page 41: Cases: Banking credit record

41

How Did You Get Here?

Followed a link from another site Clicked on a banner ad Did a Google search Saw an advertisement on television, or heard

one on radio Received a direct mail solicitation Received a phone solicitation Heard the site mentioned or recommended on

a news or specialty program, or read about it in the printed media

Page 42: Cases: Banking credit record

42

Links

Banner Ad=

Link+Image

Title

URL

Images

Page 43: Cases: Banking credit record

43

Click on this link

to find out more

or e-mail the seller.

Link to dealer’s Web

site.

Page 44: Cases: Banking credit record

44

Web Mining for Profitability

Increase viewing, navigation, and transaction efficiency.

Improve the customer experience. Add services and features that promote cross-

selling and up-selling opportunities. Identify problem areas. Improve security. Attract more high quality customers.

Page 45: Cases: Banking credit record

45

Michael Berry’s Internet Business Taxonomy

Classification is based on an Internet company’s business model, which may include:selling things that get delivered in a truckselling things that get delivered through the etherselling eyes to advertisersconnecting sellers and buyersempowering communities and collecting donations.

Page 46: Cases: Banking credit record

46

Some Business Questions

Who is visiting my Web site? Who is buying my product(s)? Who are my repeat buyers? Which customers are churning? Which Web design produces the most

purchases? What campaign strategies are most effective

in increasing Web site visits?

Page 47: Cases: Banking credit record

47

More Questions

What factors influence product purchases?• Time-of-day effects• Gender, Age, Income, and so forth• Latent factors: e-shopper, Web expert, and so

forth Which sales channels produce the most

profitable customers? Do any site-visit patterns correlate with

outcomes that can be exploited for business advantage?

Page 48: Cases: Banking credit record

48

Web Log Fields User’s IP address, also called

Remote host name Client IP address

User name, also called Remote user log name (may be different) Authenticated user name

Date and time of request, with or without a UTC offset Request type, also called “method”

HTTP request with (CLF) or without (IIS) argument Status: HTTP three digit status code Number of bytes sent to client

continued...

Page 49: Cases: Banking credit record

49

Web Log Fields

The URL path requested, if request type has no argument The port to which the request was served The name of the server The IP address of the server The time taken to serve the request Number of bytes in the request received from the client User agent, which is usually a text string with the name

and version number of Web browser used by the client and the operating system of the client machine

The domain name or IP address of the referring URL Query information in a text string Cookie information in a text string

Page 50: Cases: Banking credit record

50

Common Log Format

ValueRemote Host Name

Remote User Log NameUsername

DateTime and UTC Offset

Request Type

Example111.22.333.44

-IRVINE/terry15/Apr/2000

11:28:14 -0700

GET /index.html HTTP/1.1

Service Status CodeBytes Sent

2002792

Page 51: Cases: Banking credit record

51

The User Session

WebServer

Browser

User requests index.htm.

Server sends copy of index.htm.

Browser parses index.htm,finds references to image files,and requests image files.

...

Page 52: Cases: Banking credit record

52

Association Rule Mining Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

Page 53: Cases: Banking credit record

53

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the

form X Y, where X and Y are itemsets

Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 54: Cases: Banking credit record

54

Obtaining a Dataset from Web Log for SAS Data Analysis Example: IMW’s Web Log Data (raw data,

SAS dataset) Data Procession Skills

Converting the dataset into an Excel file Importing the data into SAS

Page 55: Cases: Banking credit record

55

SAS Association Model

Page 56: Cases: Banking credit record

56

Association Rules from IMW’s Dataset

Page 57: Cases: Banking credit record

57

Exercise 6

Download IMW’s Web Log raw data (raw data)

Data conversion within Excel Import the dataset to SAS Try out SAS Association Analysis model