Top Banner
Data Mining David L. Olson James & H.K. Stuart Professor in MIS University of Nebraska Lincoln
54
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DATA MINING

Data Mining

David L. OlsonJames & H.K. Stuart Professor in MIS

University of Nebraska Lincoln

Page 2: DATA MINING

Definition

• DATA MINING: exploration & analysis– by automatic means– of large quantities of data– to discover actionable patterns & rules

• Data mining a way to utilize massive quantities of data that businesses generate

Page 3: DATA MINING

Retail Outlets

• Bar coding & Scanning generate masses of data– customer service– inventory control– MICROMARKETING– CUSTOMER PROFITABILITY ANALYSIS– MARKET BASKET ANALYSIS

Page 4: DATA MINING

FINGERHUT

• Founded 1948– today sends out 130 different catalogs– to over 65 million customers– 6 terabyte data warehouse– 3000 variables of 12 million most active

customers– over 300 predictive models

• Focused marketing

Page 5: DATA MINING

Fingerhut

• Purchased by Federated Department Stores for $1.7 billion in 1999 (for database)

• Fingerhut had $1.6 to $2 billion business per year, targeted at lower-income households

• Can mail 400,000 packages per day

• Each product line has its own catalog

Page 6: DATA MINING

Fingerhut

• Uses segmentation, decision tree, regression, neural network tools from SAS and SPSS

• Segmentation - combines order & demographic data with product offerings– can target mailings to greatest payoff

• customers who recently had moved tripled their purchasing 12 weeks after the move

• send furniture, telephone, decoration catalogs

Page 7: DATA MINING

Data for SEGMENTATION

cluster indices

subj age income marital grocery dine out savings

1001 53 80000 wife 180 90 30000

1002 48 120000 husband 120 110 20000

1003 32 90000 single 30 160 5000

1004 26 40000 wife 80 40 0

1005 51 90000 wife 110 90 20000

1006 59 150000 wife 160 120 30000

1007 43 120000 husband 140 110 10000

1008 38 160000 wife 80 130 15000

1009 35 70000 single 40 170 5000

1010 27 50000 wife 130 80 0

Page 8: DATA MINING

Initial Look at Data

• Want to know features of those who spend a lot dining out

• INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLE– things you can identify

• Manipulate data– sort on most likely indicator (dine out)

Page 9: DATA MINING

Sorted by Dine Out

cluster indices

subject age income marital grocery dine out savings

1004 26 40000 wife 80 40 0

1010 27 50000 wife 130 80 0

1001 53 80000 wife 180 90 30000

1005 51 90000 wife 110 90 20000

1002 48 120000 husband 120 110 20000

1007 43 120000 husband 140 110 10000

1006 59 150000 wife 160 120 30000

1008 38 160000 wife 80 130 15000

1003 32 90000 single 30 160 5000

1009 35 70000 single 40 170 5000

Page 10: DATA MINING

Analysis

• Best indicators– marital status– groceries

• Available– marital status might be easier to get

Page 11: DATA MINING

Fingerhut

• Mailstream optimization– which customers most likely to respond to

existing catalog mailings– save near $3 million per year– reversed trend of catalog sales industry in 1998– reduced mailings by 20% while increasing net

earnings to over $37 million

Page 12: DATA MINING

Banking

• Among first users of data mining

• Used to find out what motivates their customers (reduce churn)

• Loan applications

• Target marketing• Norwest: 3% of customers provided 44% profits

• Bank of America: program cultivating top 10% of customers

Page 13: DATA MINING

CREDIT SCORING

Bank Loan ApplicationsAge Income Assets Debts Want On-time

24 55557 27040 48191 1500 1

20 17152 11090 20455 400 1

20 85104 0 14361 4500 1

33 40921 91111 90076 2900 1

30 76183 101162 114601 1000 1

55 80149 511937 21923 1000 1

28 26169 47355 49341 3100 0

20 34843 0 21031 2100 1

20 52623 0 23054 15900 0

39 59006 195759 161750 600 1

Page 14: DATA MINING

Characteristics of Not On-time

Age Income Assets Debts Want On-time

28 26169 47355 49341 3100 0

20 52623 0 23054 15900 0

Here, Debts exceed Assets

Age Young

Income Low

BETTER: Base on statistics, large samplesupplement data with other relevant variables

Page 15: DATA MINING

CHURN

• Customer turnover

• critical to:– telecommunications– banks– human resource management– retailers

Page 16: DATA MINING

Identify characteristics of those who leave

Age Time-job Time-town min bal checking savings card loan

years months months $

27 12 12 549 x x

41 18 41 3259 x x x

28 9 15 286 x x

55 301 5 2854 x x x

43 18 18 1112 x x x

29 6 3 0 x

38 55 20 321 x x x

63 185 3 2175 x x x

26 15 15 386 x x

46 13 12 1187 x x x

37 32 25 1865 x x x

Page 17: DATA MINING

Analysis

• What are the characteristics of those who leave?– Correlation analysis

• Which customers do you want to keep?– Customer value - net present value of customer

to the firm

Page 18: DATA MINING

Correlation

Age Time Time min-bal check saving card loan

Job Town

Age 1.0 0.6 0.4 -0.4 0.0 0.4 0.2 0.3

Job 1.0 0.9 -0.6 0.1 0.6 0.9 -0.2

Town 1.0 -0.5 -0.1 0.3 0.5 0.4

Min-Bal 1.0 -0.2 0.3 0.6 -0.1

Check 1.0 0.5 0.2 0.2

Saving 1.0 0.9 0.3

Card 1.0 0.5

Loan 1.0

Page 19: DATA MINING

Mortgage Market

• Early 1990s - massive refinancing

• need to keep customers happy to retain

• contact current customers who have rates significantly higher than market– a major change in practice– data mining & telemarketing increased Crestar

Mortgage’s retention rate from 8% to over 20%

Page 20: DATA MINING

Banking

• Fleet Financial Group – $30 million data warehouse– hired 60 database marketers,

statistical/quantitative analysts & DSS specialists

– expect to add $100 million in profit by 2001

Page 21: DATA MINING

Banking

• First Union– concentrated on contact-point– previously had very focused product groups,

little coordination– Developed offers for customers

Page 22: DATA MINING

CREDIT SCORING

• Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement

programs, securities underwriting, other

• Statistical & mathematical models (regression) to predict repayment

Page 23: DATA MINING

CUSTOMER RELATIONSHIP MANAGEMENT (CRM)

• understanding value customer provides to firm– Kathleen Khirallah - The Tower Group

• Banks will spend $9 billion on CRM by end of 1999

– Deloitte • only 31% of senior bank executives confident that

their current distribution mix anticipated customer needs

Page 24: DATA MINING

Customer Value

Middle aged (41-55), 3-9 years on job, 3-9 years in town, savings account

year annual purchases profit discounted net 1.3 rate

1 1000 200 153 153

2 1000 200 118 272

3 1000 200 91 363

4 1000 200 70 433

5 1000 200 53 487

6 1000 200 41 528

7 1000 200 31 560

8 1000 200 24 584

9 1000 200 18 603

10 1000 200 14 618

Page 25: DATA MINING

Younger Customer

Young (21-29), 0-2 years on job, 0-2 years in town, no savings account

year annual purchasesprofit discounted net 1.3

1 300 60 46 46

2 360 72 43 89

3 432 86 39 128

4 518 104 36 164

5 622 124 34 198

6 746 149 31 229

7 896 179 29 257

8 1075 215 26 284

9 1290 258 24 308

10 1548 310 22 331

Page 26: DATA MINING

Credit Card Management

• Very profitable industry

• Card surfing - pay old balance with new card

• promotions typically generate 1000 responses, about 1%

• in early 1990s, almost all mass-marketing

• data mining improves (lift)

Page 27: DATA MINING

LIFT

• LIFT = probability in class by sample divided by probability in class by population– if population probability is 20% and

sample probability is 30%,

LIFT = 0.3/0.2 = 1.5

• best lift not necessarily bestneed sufficient sample size

as confidence increases, longer list but lower lift

Page 28: DATA MINING

Lift Example

• Product to be promoted

• Sampled over 10 identifiable segments of potential buying population– Profit $50 per item sold– Mailing cost $1– Sorted by Estimated response rates

Page 29: DATA MINING

Lift Data

Seg Rate Rev Cost Profit Seg Rate Rev Cost Profit

1 0.042 $2.10 $1 $1.10 6 0.013 $0.65 $1 -$0.35

2 0.035 $1.75 $1 $0.75 7 0.009 $0.45 $1 -$0.55

3 0.025 $1.25 $1 $0.25 8 0.005 $0.25 $1 -$0.75

4 0.017 $0.85 $1 -$0.15 9 0.004 $0.20 $1 -$0.80

5 0.015 $0.75 $1 -$0.25 10 0.001 $0.05 $1 -$0.95

Page 30: DATA MINING

Lift Chart

LIFT

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10

Segment

Cu

mu

lati

ve P

rop

ort

ion

Cum Response

Random

Page 31: DATA MINING

Profit Impact

PROFIT

-4

-2

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 9 10

Segment

Do

lla

rs Cum Revenue

Cum Cost

Cum Profit

Page 32: DATA MINING

INSURANCE

• Marketing, as retailing & banking

• Special: – Farmers Insurance Group - underwriting system

generating $ millions in higher revenues, lower claims

• 7 databases, 35 million records

– better understanding of market niches• lower rates on sports cars, increasing business

Page 33: DATA MINING

Insurance Fraud

• Specialist criminals - multiple personas

• InfoGlide specializes in fraud detection products– similarity search engine

• link names, telephone numbers, streets, birthdays, variations

• identify 7 times more fraud than exact-match systems

Page 34: DATA MINING

Insurance Fraud - Link Analysis

claim

type amount physician attorney

back 50000 Welby McBeal

neck 80000 Frank Jones

arm 40000 Barnard Fraser

neck 80000 Frank Jones

leg 30000 Schmidt Mason

multiple 120000 Heinrich Feiffer

neck 80000 Frank Jones

back 60000 Schwartz Nixon

arm 30000 Templer White

internal 180000 Weiss Richards

Page 35: DATA MINING

Insurance Fraud

• Analytics’ NetMap for Claims– uses industry-wide database– creates data mart of internal, external data– unusual activity for specific chiropractors, attorneys

• HNC Insurance Solutions– workers compensation fraud

• VeriComp - predictive software (neural nets)– saved Utah over $2 million

Page 36: DATA MINING

TELECOMMUNICATIONS

• Deregulation - widespread competition– churn

• 1/3rd poor call quality, 1/2 poor equipment

– wireless performance monitor tracking• reduced churn about 61%, $580,000/year

– cellular fraud prevention– spot problems when cell phones begin to go

bad

Page 37: DATA MINING

Telecommunications

• Metapath’s Communications Enterprise Operating System– help identify telephone customer problems

• dropped calls, mobility patterns, demographics

• to target specific customers

– reduce subscription fraud• $1.1 billion

– reduce cloning fraud• cost $650 million in 1996

Page 38: DATA MINING

Telecommunications

• Churn Prophet, ChurnAlert– data mining to predict subscribers who cancel

• Arbor/Mobile– set of products, including churn analysis

Page 39: DATA MINING

TELEMARKETING

• MCI uses data marts to extract data on prospective customers– typically a 2 month program– 20% improvement in sales leads– multimillion investment in data marts & hardware– staff of 45– trend spotting (which approaches specific

customers like)

Page 40: DATA MINING

Telemarketing

• Australian Tourist Commission– maintained database since 1992

• responses to travel inquiries on tours, hotels, airlines, travel agents, consumers

• data mine to identify travel agents & consumers responding to various media

• sales closure rate at 10% and up

• lead lists faxed weekly to productive travel agents

Page 41: DATA MINING

Telemarketing

• Segmentation– which customers respond to new promotions, to

discounts, to new product offers– Determine who

• to offer new service to

• those most likely to commit fraud

Page 42: DATA MINING

Human Resource Management

• Identify individuals liable to leave company without additional compensation or benefits

• Firm may already know 20% use 80% of offered services– don’t know which 20%– data mining (business intelligence) can identify

• Use most talented people in highest priority(or most profitable) business units

Page 43: DATA MINING

Human Resource Management

• Downsizing– identify right people, treat them well– track key performance indicators– data on talents, company needs, competitor

requirements

• State of Mississippi’s MERLIN network– 30 databases (finance, payroll, personnel, capital

projects)– Cognos Impromptu system - 230 users

Page 44: DATA MINING

CASINOS

• Casino gaming one of richest data sets known

• Harrah’s - incentive programs– about 8 million customers hold Total Gold

cards, used whenever the customer spends money in the casino

– comprehensive data collection

• Trump’s Taj Card similar

Page 45: DATA MINING

Casinos

• Bellagio & Mandelay Bay– strategy of luxury visits– child entertainment– change from old strategy - cheap food

• Identify high rollers - cultivate– identify those to discourage from play– estimate lifetime value of players

Page 46: DATA MINING

ARTS

• computerized box offices leads to high volumes of data

• Identify potential consumers for shows

• software to manage shows– similar to airline seating chart software

Page 47: DATA MINING

Research Projects

• Techniques– Statistics (difference between data mining,

conventional statistics)

• Data Management– How to beat data into usable form

• Visualization– Manny Parzen

• Applications

Page 48: DATA MINING

Class Projects

• Application– Gallup: rehabilitation of drug-using women– Relationship between strengths-based counseling,

success– Finding: good counselor relationship key

• Data: limited (but Gallup tends to agglomerate thousands over time)

• Technique– Regression

• On ordinal data

Page 49: DATA MINING

Sleep Disorder Prediction

• OSHA data– 11 Nebraska plants– Demographic data– Epworth sleepiness scale– 21 sleep disorder variables

• Applied Clementine models– Trained on 1500– Tested 214

• If 4 of 6 models predicted problem, assigned– Increased prediction accuracy

Page 50: DATA MINING

Test Bank Analysis

• Effort to develop on-line test– Math department

– Service course – freshmen, entire University• Thousands of cases

• Data manipulation problem– Some link analysis

• Early prediction of performance

• Identified which questions predicted results– Used to take corrective action early

Page 51: DATA MINING

Survey Mode Effects

• Gallup• Surveys via telephone or internet

– Effect of Interviewer

• DATA– 2,979 Internet, 900 telephone

• DATA MINING– Decision trees & neural networks– Provided valuable information when traditional models

limited by missing data

Page 52: DATA MINING

IT R&D & Economy

• Proposal: more IT R&D, better economy– Apparently not reverse

• Data: 30 thousand cases, COMPUSTAT– 28 quarter differences

• Technique:– Decision Tree, SQL Server

• Results– Weak– Look at more variables

Page 53: DATA MINING

IT Effect on Firm Size

• IT reduces transaction costs, reducing firm size• IT reduces coordination costs, increasing firm size• DATA:

– Private fixed investment on IT– Firm size– Compustat

• DATA MINING: – Rules

• Source of hypotheses

Page 54: DATA MINING

Online Auction Fraud Prediction

• eBay– Over 16 million items per day– Fraud: 3,700 in 1999, 6,600 in 2000

• Purpose:– Predict seller fraud profile, which products

• User Trust• DATA: golf clubs, humidifiers• Initial results inconclusive – more work