Chapter 1 Initial Description of Data Mining in Business

Post on 27-Jan-2015

110 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

Transcript

Chapter 1Chapter 1Initial Description of Data Mining Initial Description of Data Mining

in Businessin Business

Prepared by: Dr. Tsung-Nan Tsai

結束

1-2

ContentsContents

Introduces data mining concepts

Presents typical business data applications

Explains the meaning of key concepts

Gives a brief overview of data mining tools

Outlines the remaining chapters of the book

結束

1-3

DefinitionDefinition

DATA MINING: exploration & analysisRefers to the analysis of the large quantities of data that

are stored in computers.by automatic meansof large quantities of datato discover actionable patterns & rules

Data mining is a way to use massive quantities of data that businesses generate

GOAL - improve marketing, sales, customer support through better understanding of customers

結束

1-4

Retail OutletsRetail Outlets

Bar coding & scanning generate masses of datacustomer service (Grocery stores can quickly

process he purchases and accurately determine product prices)

inventory control (Determine the quantity of items of each product on hand, supply chain management)

MICROMARKETINGCUSTOMER PROFITABILITY ANALYSISMARKET-BASKET ANALYSIS

結束

1-5

Political Data MiningPolitical Data Mining

Grossman et al., 10/18/2004, Time, 38

2004 ElectionRepublicans: VoterVault

From Mid-1990sAbout 165 million votersMassive get-out-the-vote drive

for those expected to vote Republican

Democrats: DemzillaAlso about 165 million votersNames typically have 200 to

400 information items

結束

1-6

Medical DiagnosisMedical Diagnosis

J. Morris, Health Management Technology Nov 2004, 20, 22-24

Electronic Medical RecordsAssociated Cardiovascular

Consultants31 physicians40,000 patients per year,

southern New JerseyData mined to identify

efficient medical practiceEnhance patient outcomesReduced medical liability

insurance

結束

1-7

Mayo ClinicMayo Clinic

Swartz, Information Management Journal Nov/Dec 2004, 8

IBM developed EMR programComplete records on almost

4.4 million patients.Doctors can ask for how last

100 Mayo patients with same gender, age, medical history responded to particular treatments.

結束

1-8

Business Uses of Data MiningBusiness Uses of Data Mining

Toyata used the data mining of its data warehouse to determine more efficient transportation routes, reducing time-to-market by average of 19 days.

Bank firms used the data mining in soliciting credit card customers,

Insurance and Telecommunication companies used DM to detect fraud.

Manufacturing firms used DM in quality control,

Many …..

結束

1-9

Business Uses of Data MiningBusiness Uses of Data Mining

1. Customer profiling Identify profitability from subset customers

2. Targeting• Determine characteristics of most profitable

customers

3. Market-Basket Analysis• Determine correlation of purchases by profile

(customers)

• Cross-selling

• Part of Customer Relationship Management

結束

1-10

What is needed to do DM?What is needed to do DM?

DM requires the identification of a problem, along with data collection that can lead to a better understanding of the market.

Computer models provide statistical or other means of analysis.

Two general types of DM studies:1. Hypothesis testing: involving expressing a theory

about the relationship between actions and outcomes.

2. Knowledge discovery: a preconceived notion may not be present, but rather than relationships can be identified by looking at the data (correlation analysis).

結束

1-11

Reasons why Data Mining is now effectiveReasons why Data Mining is now effective

Data are there

Data are warehoused (computerized)Walmart: 35 thousand queries per week

Computing economically available

Competitive pressure

Commercial products available

結束

1-12

TrendsTrends

Every business is servicehotel chains record your

preferencescar rental companies the sameservice versus price

credit card companieslong distance providersairlinescomputer retailers

結束

1-13

TrendsTrends

Information as ProductCustom Clothing Technology Corporation

fit jeans, other clothing

INFORMATION BROKERINGIMS - collects prescription data from pharmacies, sells

to drug firmsAC Nielsen - TV

結束

1-14

TrendsTrends

Commercial Software Availableusing statistical, artificial intelligence tools

that have been developedEnterprise Miner SASIntelligent Miner IBMClementine SPSSPolyAnalyst MegaputerSpecialty products

結束

1-15

Fingerhut’s DM modelsFingerhut’s DM models

Fingerhut used segmentation, decision tree, regression analysis, and neural modeling tools from SAS for regression analysis tools and SPSS for neural network tools.

The segmentation model combines order and basic demographic data with Fingerhut’s product offerings.

Neural network models used to identify in mailing patterns and order filling telephone call orders.

Goal: Create new mailings targeted at customers with the greatest

potential payoff. Create a catalog containing products that those who is interested

in, such as furniture, telephones…

結束

1-16

How Data Mining Is Being UsedHow Data Mining Is Being Used

U.S. Government track down Oklahoma City

bombers, Unabomber, many others

Treasury department - international funds transfers, money laundering

Internal Revenue Service

結束

1-17

How Data Mining Is UsedHow Data Mining Is Used

Fireflyasks members to rate

music and moviessubscribers clusteredclusters get custom-

designed recommendations

結束

1-18

Warranty Claims RoutingWarranty Claims Routing

Diesel engine manufacturerstream of warranty claimsexamine each by expert

determine whether charges are reasonable & appropriate

think of expert system to automate claims processing

結束

1-19

Data mining application areaData mining application area

Application Area Applications Specifics

Retailing Affinity positioning

Cross-selling

Position products effectively

Find more products for customers

Banking Customer relationship management

Identify customer value

develop programs to maximize revenue

Credit card Management

Lift

Churn,

Fraud detection

Identify effective market segments

Identify likely customer turnover

Insurance Fraud detection Identify claims meriting investigation

Telecommunications Churn Identify likely customer turnover

Telemarketing Online information Aid telemarketers with easy data access

Human Resource Management

Churn Identify potential employee turnover

結束

1-20

RetailingRetailing

Affinity positioning is based up the identification of products that the same customer is likely to want.Cold medicine tissues

Cross-selling: The knowledge of products that go together can be used by marketing the complementary product.Grocery stores do that through position product shelf

location.

Grocery stores generate mountains of cash register data. Current technology enables grocers to look at customers who have defected from a store, their purchase history, and characteristics of other potential defectors.

結束

1-21

Cross-sellingCross-selling

USAA insurancedoubled number of products held by average

customer due to data miningdetailed records on customerspredict products they might need

Fidelity Investmentsregression - what makes customer loyal

結束

1-22

BankingBanking

CRM involves the application of technology to monitor customer service, a function that is enhanced through data mining support.

DM applications in finance include predicting the prices of equities involving a dynamic environment with surprise information, some of which might be inaccurate …

Only 3% of the customers at Norwest bank provided 44% of their profits.

CRM products enable banks to define and identify customer and household relationships.

結束

1-23

Retaining Good CustomersRetaining Good Customers

Customer loss:Banks - AttritionCellular Phone Companies - Churn

study who might leave, whySouthern California Gas

– customer usage, credit information

– direct mail contact - most likely best billing plan

– who is price sensitive

Who should get incentives, whom to keep

結束

1-24

Credit card managementCredit card management

Bank credit card marketing promotions typically generate 1,000 responses to mailed solicitations – a response rate of about 1%. The rate is improved significantly through data mining analysis.

DM tools used by banks include credit scoring which is a quantified analysis of credit applicants with respect to predictions of on-time loan repayment. (Data covering deposits, savings, loans, credit card, insurance…).

These credit scores can be used to accept/reject recommendations, as well as to establish the size of a credit line.

ATM machines could be rigged up with electronic sales pitches for products that a particular customer is likely to be interested in.

結束

1-25

Fairbank & MorrisFairbank & Morris

Credit card company’s most valuable asset:INFORMATION ABOUT CUSTOMERS

Signet Banking Corporationobtained behavioral data from many sourcesbuilt predictive modelsaggressively marketed balance transfer card

First Unionwho will move soon - improve retention

結束

1-26

TelecommunicationsTelecommunications

Retention of customers for telemarketing is very difficult. The phenomenon of a customer switching carriers is referred to as churn, a fundamental concept in telemarketing as well as in other fields.A communications company considered the 1/3 of churn is due to poor call quality, and up to ½ is due to poor equipment.A cellular fraud prevention monitors traffic to spot problems with faulty telephones. When a telephone begins to go bad, telemarketing personal are alerted to contact the customer and suggest bringing the equipment in for service.Another way to reduce churn is to protect customers from subscription and cloning (duplication) fraud. Fraud prevention systems provide verification that is transparent to legitimate subscribers.

結束

1-27

Human resource managementHuman resource management

Business intelligence is a way to truly understand markets, competitors, and processes.Software technology such as data warehouses, data marts, online analytical processing (OLAP), and data mining can be used to improve firm’s profitability.In HRM, the analysis can lead to the identification of individuals who are liable to leave the company unless additional compensation or benefits are provided.HRM would identify the right people so that organizations could treat them well and retain them (reduce churn).

結束

1-28

Methodology and ToolsMethodology and Tools

Analyzing dataGiven management goals and that management

can translate knowledge into action

結束

1-29

Basic StylesBasic Styles

Top-Down: HYPOTHESIS TESTINGSUPERVISEDhave a theory, experiment to prove or disproveSCIENCE

Bottom-Up: KNOWLEDGE DISCOVERYUNSUPERVISEDstart with data, see new patternsCREATIVITY

結束

1-30

Hypothesis TestingHypothesis Testing

Generate theory

Determine data needed

Get data

Prepare data

Build computer model

Evaluate model resultsconfirm or reject hypotheses

結束

1-31

Generate TheoryGenerate Theory

Systematically tie different input sources together (MENTAL MODEL)What causes sales volume?

sales rep performanceeconomy, seasonalityproduct quality, price, promotion,

location

結束

1-32

Generate TheoryGenerate Theory

Brainstorm:diverse representatives for broad coverage of

perspectives (electronic)keep under control (keep positive)generate testable hypotheses

結束

1-33

Define Data NeededDefine Data Needed

Determine data needed to test hypothesisLucky - query existing databaseMore often - gather

pull together from diverse databases, survey, buy

結束

1-34

Locate DataLocate Data

Usually scattered or unavailable

Sources: warranty claims

point-of-sale data (cash register records) medical insurance claims telephone call detail records direct mail response records demographic data, economic data

PROFILE: counts, summary statistics, cross-tabs, cleanup

結束

1-35

Prepare Data for AnalysisPrepare Data for Analysis

Summarize: too much - no discriminant information too little - swamped with useless

detailProcess for computer: ASCII, SpreedsheetData encoding: how data are recorded can vary - may have been collected with specific purposeTextual data: avoid if possible (may need to code)Missing values: missing salary - use mean?

結束

1-36

Build and Evaluate ModelBuild and Evaluate Model

Build Computer ModelChoice the appropriate modeling tools and algorithmsTraining and test data sets.

Determine if hypotheses supportedstatistical practicetest rule-based systems for accuracy

Requires both business and analytic knowledge

結束

1-37

SUPERVISEDSUPERVISED

Dorn, National Underwriter Oct 18, 2004, 34,39

Health care fraudUse statistics to identify

indicators of fraud or abuseCan rapidly sort through large

databasesIdentify patterns different from

normModerately successful

But only effective on schemes already detected

To benefit firm, need to identify fraud before paying claim

結束

1-38

Knowledge DiscoveryKnowledge Discovery

Machine learning?Usually need intelligent analyst

Directed: explain value of some variable

Undirected: no dependent variable selectedidentify patterns

Use undirected to recognize relationships; use directed to explain once found

結束

1-39

DirectedDirected

Goal-orientedExamples: If discount applies, impact on products -

who is likely to purchase credit insurance?Predicted profitability of new customer - what to bundle with a particular packageIdentify sources of preclassified dataPrepare data for analysisBuilt & train computer modelEvaluate

結束

1-40

Identify Data SourcesIdentify Data Sources

Best - existing corporate data warehousedata clean, verified, consistent, aggregated

Usually need to generatemost data in form most efficient for designed

purposehistorical sales data often purged for dormant

customers (but you need that information)

結束

1-41

Prepare DataPrepare Data

Put in needed format for computer

Make consistent in meaning

Need to recognize what data are missingchange in balance = new – old

add missing but known-to-be-important data

Divide data into training, test, evaluation

Decide how to treat outliersstatistically biasing, but may be most important

結束

1-42

Build & Train ModelBuild & Train Model

Regression - human builds (selects IVs)

Automatic systems traingive it data, let it hammer

OVERFITTING:fit the dataTEST SET a means to evaluate model against

data not used in trainingtune weights before using to evaluate

結束

1-43

Evaluate ModelEvaluate Model

ERROR RATE: proportion of classifications in evaluation set that were wrong

too little training: poor fit on training data and poor error rate

optimal training: good fit on both

too much training: great fit on training data and poor error rate

結束

1-44

Undirected DiscoveryUndirected Discovery

What items sell together? Strawberries & creamDirected: What items sell with tofu? tabasco

Long distance caller market segmentationUniform usage - weekday & weekend, spikes

on holidaysAfter segmentation:

high & uniform except for several months of nothing

結束

1-45

UNSUPERVISEDUNSUPERVISED

Dorn, National Underwriter Oct 18, 2004, 34,39

Health care fraudLook at historical claim

submissionsBuild ad hoc model to

compare with current claims

Assign similarity score to fraudulent claims

Predict fraud potential

結束

1-46

Undirected ProcessUndirected Process

Identify data sources

Prepare data

Build & train computer model

Evaluate model

Apply model to new data

Identify potential targets for undirected

Generate new hypotheses to test

結束

1-47

Generate hypothesesGenerate hypotheses

Any commonalities in data?

Are they useful?Many adults watch children’s movies

chaperones are an important market segmentthey probably make final decision

When hypothesis is generated, that determines data needed

結束

1-48

Bank Case StudyBank Case Study

Directed knowledge discovery to recognize likely prospects for home equity loan

training set - current loan holdersdeveloped model for propensity to borrow got continuous scores, ranked customerssent top 11% material

Undirected: segmented market into clustersin one, 39% had both business & personal

accountscluster had 27% of the top 11%

Hypothesis: people use home equity to start business

結束

1-49

Data mining products and data setsData mining products and data sets

A good source to view current DM products is www.KDNuggests.com.

The UCI Machine Learning Repository is a source of very good data mining datasets at www.ics.uci.edu/~mlearn/MLOther.html.

Weka DM software at http://www.cs.waikato.ac.nz/ml/weka/

Tanagra DM software at http://eric.univ-lyon2.fr/~ricco/tanagra/index.html

top related