Lecture 7 MARK2039 Summer 2006 George Brown College Wednesday 9-12.

Lecture 7

MARK2039

Summer 2006

George Brown College

Wednesday 9-12

2

Exam

1) You are running an analysis to determine the number of customers that are poor credit risk that live in Montreal and that have been promoted in the last month. There are 3 million customers and 50 million promotion records. The analysis has taken over a day. The customer file and promotion file contains the following fields:

Answer the following: a) What fields would you pull to do the query. b) Give one suggestion on how would you improve the run time of this query.

Customer Promotion CodeAccount ID Account IDHoushold Number Date of PromotionCredit Score Promotion TypePostal code

a)Acct ID, Date of Promotion, credit score, postal codeb)Index account ID and make the DB relational

3

Exam

2)Listed below are 3 columns with each column containing 5 valuesi Answer the following:

1) What is the mean and median of each column 2) What column contains the normal distribution and why? 3) What would be the better reporting measure for the non normal distribution

and why?

Column A Column B Column C120 10 20080 5 20000

40000 20 18000140 15 2200090 25 24000

1)Col A: mean=8086, median=120 COL. B: mean=15,median=15Col. C: mean= 16840 median=20000

2)Normal dist. Is B because mean and median are same.3)Median as it is not skewed by otliers

4

Exam3) The current expected performance of a given campaign is 4%. Two strategies have been tested with the following results. What would you conclude for each strategy and what would you do for the next campaign based on the learning (Hint: you have to conduct your calculations on both tests)

Strategy # of names Response RateStrategy A 40000 3.80%Strategy B 2000 2.00%

Str. A: std.dev= .00189 – CI: .0361<=.0380<=.03989Str.B: std.dev=.003 -CI: .014<=.02<=.026

Do not use either strategy and continue with existing strategy

5

Exam4)Three initiatives are outlined below. Assume that data mining can yield a 15% lift. What initiative would you pursue and why? Show your calculations.

-Outbound Telemarketing campaign with an available universe of 75000 names at $3.00 per name -Email Campaign with an available universe of 10000000 names at $.10 per name -Direct Mail Campaign with an available universe of 100000 names at $2.00 per name

75000 862 1.1586250 862 1

Cost Diff: $33,750

10000000 115000 1.1511500000 115000 1Cost Diff: $150,000

100000 1150 1.15115000 1150 1

Cost Diff: $30,000

6

Exam5)The marketing team wants the flexibility and the ability to conduct its own analysis without I/T or system resources. The customer file and transaction file contains the following fields:

Answer the following:

a) What type of technology would you use b) Give me a design that contains three dimensions and one measure c) Provide a query that can be conducted based on your above design.

a)Cubeb)dimensions:product type,1st digit of postal code,payment typeMeasure: acct Idc)Give me count of all customers who bought prod. A with cash

6) You are given the postal code data of each customer for company XYZ. How might company XYZ use this information to better target prospects to become customers

Determine number of customers in postal code, determine number of persons in postalcode from Stats Can data. Create penetration index: Number of customers/ number of persons at postal code. Rank postal codes by penetration index and use ranked postal codes to targetprospects.

7

Exam

7)You are given a customer file with postal code data only. You can then append Stats Can taxfiler data and Stats Can Census Data. Which data would be richer in terms of providing more granular data and why? What might be the advantage of using Stats Can Taxfiler data.

Stats Can Census is richer as it has more records(50000 vs. 28000 for taxfiler

Advantage of using Taxfiler data is that data is more recent

8. Answer the following Questions a)What is the last stage of data mining? b)What is more important in data mining-reducing costs or maximizing revenues ? c)What must happen to the data before it gets used in a data mining application? d)What is the metric that allows us to look at how data varies within a population?

a) Implementation b) Reducing costsc) Must be one to one in analytical filed) Standard deviation or variation

8

Exam9) What is a more accurate estimate of weight -Sample A: 150 lbs with std. dev of 5 lbs -Sample B- 25 pounds with std. of 4 lbs.

Explain why?

Sample A , although std. dev. is larger, if we look at std. dev. on a relative basis when comparing to the range or magnitude of values in the sample, we will observe that we are getting a much tighter bound around A rather than B

10) Give me one example of a legacy type system file. Give me one advantage of why you might build a data mart

Legacy: billing or call detail files,external data such as Stats CanAdvantage to building data mart is the following:

-data aggregated and summarized-easier to use for analysis-Quicker processing-Easier intrpretation as data deals solely with functional area

9

Exam11) Answer yes or no on whether data mining should be used

i) Creating a national advertising program ii) Identifying your most profitable customers iii) Trying to maximize the revenue of a campaign. iv) Using Survey Results(10% of customer base) to create a targeted customer list v) Analyzing the results of a direct marketing campaign.

i)No,ii)yes,iii)No,iv)No,v)yes12) Listed below is a table containing 5 variables. For each variable, do the following a)Indicate if it is nominal, ordinal or interval b)Indicate whether the variable is useful and provide 1 sentence for your reasoning.

Variable # of records # of unique values # of missing valuesPromotion Date 100000 1 0

Promotion Codes 150000 5000 0Income 75000 70000 70000

Number of Children 75000 6 10000Credit Decile Rank 75000 10 0

Prom.Date-interval,not useful,only one valueProm.codes-nominal-not useful-too granularIncome-interval-not useful too many missing valuesNumber of children: interval-useful-few missing valuesCredit decile rank: ordinal-useful-0 missing values

10

Creating the Analytical File-Reviewing Data Dumps

Initial dump of 1st few records

Account Postal Birth Start Behave. Income # inNumber Code Date Date Score House123456 M5A3S6 07/49 03/91 500 30000 6

345231 H3A2B4 08/54 04/92 550 42500 1

543236 T5A3S7 06/92 600 35000 3 543210etc…

Missing values in data are not properly being treated.

11


Proper treatment of missing values results in the following dump:

Account Postal Birth Start Behav. Income # inNumber Code Date Date Score House123456 M5A3S6 07/49 03/91 500 30000 6

345231 H3A2B4 08/54 04/92 550 42500 1

543236 T5A3S7 06/92 600 35000 3

543210 etc…

Effective programming can ensure that records are being properly loaded into the system.

Initial dump of 1Initial dump of 1stst few records few records

12


A dump of a few records from a billing file revealed the following after sorting by account number

Account Purchase Product Date ofAmount Category Purchase

123460 $50 ABC123 19980630123460 $75 DEF789 19980703456720 $90 GHI123 19980701456720 $100 ABC456 19980715333121 $25 JKL432 19980315333121 $40 GHI342 19980401789232 $30 GHI261 19980228789232 $20 236phi 19980307

View of the Transaction FileView of the Transaction File

13


A dump of a few promotion history records revealed the following after sorting by account number:

Account No. Promotion ID Promotion Date 123460 ABA123 19970115123460 ACB431 19970315123460 AAC221 19970618456720 BAA123 19970115456720 BBA321 19980115456720 BCB330 19980315456720 BAC112 19980618333121 CBA321 19980115789232 BAD333 19980415

View of the Promo History FileView of the Promo History File

14


• Using your marketing knowledge, give me examples of variables that we might create from the last three slides– Slide 11– Slide 12– Slide 13

• Slide 11: Age, region of country, tenure• Slide 12: Total Amount, Total amount for a given product, and

recency of purchase.• Slide 13: Total promotions, Total Promotions by Type and recency of

last promotion

15

Creating the Analytical File-Data Hygiene and Cleansing

• Once the data has been dumped in order to view records, typically data hygiene and cleansing have to take place

• Two key deliverables– Clean name and address information– Standard rules for coding of data values

16

Creating the Analytical File-Data Hygiene and Cleansing

• Clean Name and Address Information– Market to right Individual– Create Match keys

17

• Clean Name and Address Information– Market to right Individual– Create Match keys– Name and Address Standardization

BankID 987654321Name JONH SMITH JR.Address1 123 WILLIAMS STRETAddress2 2ND FLOORAddress3 TRT., O.N. M5G-1F3Country CDNUnIndivID 123456789

BankID 987654321PreNameFirstNameSurname JONH SMITH JR.PostNameStreet1 123 WILLIAMS STRETStreet2 2ND FLOOR

City TRTProvince O.N.Postal Code M5G-1F3Country CANADAUnIndivID 123456789Origin Bank

Creating the Analytical FileCreating the Analytical File Name and Address Standardization

18

DATA CLEANING• Address correction• Name parsing• Genderizing• Casing

BankID 987654321PreName Mr.FirstName JohnSurname SmithPostName Jr.Street1 200-123 Williams StreetStreet2

City TorontoProvince ONPostal Code M5G 1F3Country CanadaUnIndivID 123456789Origin Bank

BankID 987654321PreNameFirstNameSurname JONH SMITH JR.PostNameStreet1 123 WILLIAMS STRETStreet2 2ND FLOOR

City TRTProvince O.N.Postal Code M5G-1F3Country CANADAUnIndivID 123456789Origin Bank

Creating the Analytical File-Creating the Analytical File-Name and Address Standardization

19

Creating the Analytical File-Merge Purge of Names

• What are the reasons for creating unique match customer keys

– Generating a marketing list– Conducting analysis

Should the match keys be the same forboth above scenarios?No, tighter matchkeys in generating lists and looser matchkeyswhen conducting analysis

What are the situations when match keys that are numeric?When dealing with existing customer data where you are matching Files involving only existing customer data.

20


Common fields to use in creating Match keys

• First Name;

• Surname;

• Unique Individual ID;

• Postal Code

• Credit Card Number

• Duns Number for Businesses

• Phone Number

Unique I.D’s or number type I.D’s are the preferred choice when creating match keys

• Let’s take a closer look at creating match keys using name and address

21


• Let’s take a look at 6 records and see what this means.

Surname First Name Address Postal Code Match Key

Smith John12345 Elm Street L1A2A1 L1A2A1SMITHJ

Smith James45678 Elm Street L1A2A1 L1A2A1SMITHJ

Brown Tim 5678 Oak M5A3A2 M5A3A2BROWNT

Brown T.5678 Oak Road M5A3A2 M5A3A2BROWNT

Green Ted 3478 Pine V6A2A1 V6A2A1GREENTGreen Tanya 3478 Pine V6A2A2 V6A2A1GREENTFiller Robert 2345 Nurr M5A3A2 M5A3A2FILLERR

Filler Larry5672 Bolton Dr. M6A2A1 M6A2A1FILLERL

22


• Example: You have one record here: – Richard Boire-4628 Mayfair Ave. H4B2E5

– How would you use the above information for a backend analysis if I were a responder to an acquisition campaign?BOIREH4B2E5

– What about if you were conducting analysis on me as an existing customer who responded to a cross-sell campaign.

– Need only customer id

– How about if you wanted to send me a direct mail piece – BOIRERICHARDH4B2E54628MAYFAIR

23

Creating the Analytical File- Data standardization

• Refers to a process where values from a common variable from different files are mapped to the same value. Some common examples:

• SIC Code Industry Classification Table– Industry categories have common set of codes

• Postal Code Variable– Postal code has to have 6 digits comprised of

alpha,numeric,alpha,numeric,alpha,numeric which exclude the following alphas: D,F,O,Q,U, and Z.

• Give me examples of bad postal codes vs. good postal codes.– D4B2E5, H442E6,etc. are bad postal codes.– M5J1A1, A1A1A3,etc. are good postal codes

24

Creating the Analytical File- Data Standardization

• Here is an example of how disposition codes for telemarketing outcomes might be handled

Code Description21 Do Not Call21 Do Not Call21 Do Not Call32 Do Not Call9 Do Not Call - Place on “Do

Not Call” list permanently

20 Do Not Solicit - Do not call, mail, email or attempt any other form of solicitations to this customer

22 Do Not Mail - Place permanently on “Do Not Mail” list; future calling solicitations ok

U28 No sale - Do not sollicitateB22 Never call again, <<Client>>B23 Never call again, general

C08 Scrubbed Vendor DNS

25

Creating the Analytical File- Data Standardization

• Postal Code Standardization– Six digit code comprising

Alpha,numeric,alpha,numeric,alpha,numeric– 1st letters: A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y

• SIC(Standard Industry Code Classification– 4 digit code used to classify all companies into standard set of

industries

26

Creating the Analytical File- Data standardization

• Example:– You have been asked to build retention model You have two

years worth of transaction data.Changes in the product category codes occurred six months ago. Key information that you would look at would be as follows:• Income category• Product Category• Transaction Codes• Transaction Amount• Postal Code• Transaction Date• GenderWhat would you need to do• Need to map the old product category code definitions from

prior to six months ago to the new product category code definitions

27

• Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search.

• Data miners often use these coordinates to calculate such things as “distance to the nearest store”

Creating the Analytical File- Geo-Codingn

28

Demographic Analysis

Population Population CountCount

Population Population CountCount

Age Age DistributionDistribution

Age Age DistributionDistribution Average AgeAverage AgeAverage AgeAverage Age

Store Store LocationLocation

Store Store LocationLocation

GeoGeoProfileProfile

29

Creating the Analytical File-What is Geocoding?

• Let’s look at a sample of what some data might look like?

Postal Code latitude LongitudeA1A5A2 5 10B5V1A2 7 20M6B2A2 10 30T4B1A2 6 40V4H2B5 11 50

How do we use this data to create meaningful variables?-using the pythagorean theorem where distance**2=lat**2+ longitude**2. This is extremely useful in calculating distance typevariables between a customer and a given location

30

Creating the Analytical File-What is Geocoding

• Example:– A retailer has the following information:

• Name and address of its customers

• Address of its stores

• Stats Can Information

– As a marketer, how would you intelligently use this information– Find the distance between the nearest store and a given customer.– Create a trading area around a given store. Find out which stores

have the best penetration. At the same time, analyze these best penetration stores and determine some key stats can attributes around these best penetration stores

Region # of Customers % of Total

Prairie Provinces 25 M 2.5%

Quebec 100 M 10%

Ontario 350 M 35%

West 25 M 2.5%

Missing Values 500 M 50%

Total 1 MM 100%

Frequency Distribution

• The report below uses first digit of postal code to assign customers to region.

• For example, postal codes beginning with ‘G’, ‘H’, or ’J’ represent the Quebec region.

Customer Profiling

32


Tenure # of

Customers % of

Customers 1998 9800 14% 1999 10000 14% 2000 12000 17% 2001 8000 11%

Missing 30000 43% Total 69800 100%

This tenure report would tell us that the tenure field was not on this database prior to 1998 and that 30,000 customers began prior to that date. Given the high percent of customers with missing values, we would need to determine whether we could capture tenure from another field in the database or not use

33


Type of Product/Services

Purchased# of

Customers% of

CustomersProduct A 35000 29.66%Product B 40000 33.90%Product C 25000 21.19%Product D 15000 12.71%

Other 3000 2.54%Total 118000 100.00%

The Product/service field has good coverage and concludes that product B has been the best selling product, followed closely by product A

34

Creating Variables

Source/ Raw File Variables

# in Household

Income

Credit score

Total lifetime spend

Total number of promotions

Derived Variables

Region of country

Total spend within certain period

Age

Tenure

Number of promotions in last year by campaign category

•Example of source variablesExample of source variables

•Example of derived variablesExample of derived variables

35

• Other variables– Total spend in certain time periods– Total spend by product category in certain time periods– Decline in spend-total & by product type– Trend variables related to spending and product category:

• Median

• Mean• Variation

– Index Variables• Grouping of variable into meaningful categories where category values are

index values

• Binary Variables-yes/no type variables such as gender

More Creations

36


A dump of a few records from a billing file revealed the following after sorting by account number

Account Purchase Product Date ofAmount Category Purchase

123460 $50 ABC123 19980630123460 $75 DEF789 19980703456720 $90 GHI123 19980701456720 $100 ABC456 19980715333121 $25 JKL432 19980315333121 $40 GHI342 19980401789232 $30 GHI261 19980228789232 $20 236phi 19980307

View of the Transaction FileView of the Transaction File

•What kind of variables can be derived. What kind of variables can be derived.

37

Creating Binary Groups

Income % of Response Response Income>Customers Rate Index 40K

under 20K 16% 1.50% 0.43

20-30K 16% 2.50% 0.71 0

30-40K 16% 2.00% 0.57

40-55K 16% 6% 1.71

55-80K 16% 5% 1.43 1

80K+ 16% 4% 1.14

Average 100% 3.50% 1.00

38

Creating Indices

# of Months % of Response Response MonthsSince Last Customers Rate Index Since LastPromotion Promotion

1 16% 2.50% 0.71

2 16% 1.50% 0.43

3 16% 3.75% 1.07

4 16% 3.25% 0.93

5 16% 6.00% 1.71

6 16% 4.00% 1.14

Average 100% 3.50% 1.00

0.620.57

1.001.00

1.431.43

39

More Variable Creation

Spending # of customers Response Rate0-100 1000 1%

100-200 1000 0.80%200-300 1000 1.20%300-400 1000 0.90%

400+ 1000 0.95%

•What would you do hereWhat would you do here•Is there any trend? Given that there seems to be noIs there any trend? Given that there seems to be no trend or impact between spend and response, it is highly trend or impact between spend and response, it is highlyunlikely that further information would be derived from thisunlikely that further information would be derived from this field. field.

40

More Variable Creation

Tenure # of customers Response Rate< 1 year 1000 3%1-2 yrs 1000 2.00%2-3 yrs 1000 1.00%3-4 yrs 1000 0.75%4yrs+ 1000 0.30%

•What would you do here?What would you do here?•Here, this variable in all likelihood would beHere, this variable in all likelihood would beuseful given its trend with response rate.useful given its trend with response rate.

41

Stage 3 of Data Mining

• What stage are we at:– Application of data mining tools

• Give me some examples of what data miners would be doing in stage 3– Data discovery

• Data Audit/Frequency Distribution Analysis, Value Segmentation

– Models,profiles,etc.– Post Campaign Analysis– Reporting i.e such as standard KBM-Key Business Measure Reports– AdHoc Reports

• Modelling and profiling represent some examples of what we might be doing in this stage.

42

Types of Predictive Models

• Examples:Discrete Models– Response Models

• Cross Sell

• Upsell

• Acquisition

– Attrition Models– Product Affinity Models– Risk Models

43

Types of Predictive Models

• Examples-Continuous Models– Profitability/Value Models– Spending Models

• What is the concept of the objective function or dependant variable?– This the variable that we trying to predict

• Response,bad credit,defection,spend,etc.

– What are we trying to optimize essentially becomes our objective function.

– This is the variable we are trying to predict

Lecture 7 MARK2039 Summer 2006 George Brown College Wednesday 9-12.

Documents

analytical filedata

data dumpsview

data deals

data dumpsusing

external data

data mart

taxfiler data

data dumpsinitial dump