Data Mining At Tech Journal

Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

The Company

• A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals

• TechJournal is 15 years old; TechPub is 5 years old

• Content for TechPub comes from three sources:

– Aggregated content from public sources

– TechJournal created content

– Peer contributed content

• TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers

The Journal

• The content on the publication website is available to both anonymous and registered users

• Registered users get access to some premium services as well

• Most content is free. Some whitepapers for sale.

• Three very unique features of the site

– Peer contributed content

– Auction system -> readers to get paid to contribute content

– New: personalized content for each reader

• Target: IT Professional involved in their organization’s technology purchasing decision

• Different levels of “readership”:

• The company continuously tries to stimulate new readership through e-mail campaigns

The Readers

E Mail RecipientsAnonymous Visits

E Mail Recipients Visited Site

E Mail Recipients Repeat Visitor

RegisteredLight Reader

RegisteredHeavy Reader

Number ofIndividuals

The Business Model

TechPub ReaderActivity

Knowledge ofReaders'Interests

Quality Of ListProduct

List Value ToTechnology

ManufacturesGathering New

Content

New Readers:Reader Word Of

Mouth

New Readers:Company

Prospecting

CompanyResources ForReinvestment

Total Readers

Tuning ofContent

“Active Readers Produce Better Lists” Loop

“Known Readers Make For Better Journal” Loop

“Success Breeds Success” Loop

“Buzz Marketing” Loop

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Focal Areas For Data MiningTechPub Reader

Activity

Knowledge ofReaders'Interests

Quality Of ListProduct

List Value ToTechnology

ManufacturesGathering New

Content

New Readers:Reader Word Of

Mouth

New Readers:Company

Prospecting

CompanyResources ForReinvestment

Total Readers

Tuning ofContent

• Is TechJournal’s current content taxonomy effective or

would some content taxonomy be more useful?

• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headlines would maximize that visit likelihood?

“Known Readers Make For Better Journal” Loop

“Active Readers Produce Better Lists” Loop

“Success Breeds Success” Loop

• Given registered readers’ attributes, which stories will they be interested in?• Given past stories read, what is a registered reader most likely to also read?• Given registered readers’ attributes, which will be most active?

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

The DataMy “Chunk of Data” to Mine:

An Issues Table713,110 records

Issues - Content Linker Table 2,185,664 records

Content Items Table 590 records

Page Visit Table 43,580 records

Recipients Table 195,455 records

Taxonomy Click Table 9,385 records

Attributes to Work With

Reader Attributes Content Attributes Format Attributes

Primary Key Recipient IDIP Address

Content IDIssue ID

Data Mining Attributes TitleCityStateCountryZipPhone IT BudgetEmployeesSalesSIC CodeIndustry Time SentTime OpenedTime of VisitTime Content Click

AbstractHeadline MainContent TypeMedia TypeAuthorContent TaxonomyClick Rate

Template TypeMedia Type (HTML, Or Video)

= Features that can be utilized directly or derived from for Classification

Creating Content Classes

1 1

Classes

5

46

798

1909

5000 +

Level

2

3

4

5

.

.

...

.

21

TechJournal’s current taxonomy for classifying content:

• Manually derived• Aggregation of other credible taxonomy fragments• From a content provider point of view• Goes out to 21 levels in some cases, others as shallow as three 31 Classes

#Visits ContentClass2925 |Software|Business2736 |Hardware|Storage1187 |Software|Operating Systems670 |Hardware|Networking314 |Software|Software Development282 |Hardware|Computers278 |Industries|News131 |Hardware|Telecom118 |Industries|IT Management97 |Hardware|Mobile Devices75 |Online|Search53 |Online|Portal42 |Hardware|Printers40 |Software|Consumer38 |Industries|PCs36 |Industries|Legal32 |Hardware|Power28 |Software|Networking21 |Hardware|News13 |Industries|Standards8 |Hardware8 |Industries|Hacking7 |Online|News4 |Online|Software as a Service4 |Hardware|Chips4 |Services|Disaster Recovery3 |Online|Email3 |Online|IM2 |Services|Security1 |Hardware|Software1 |Services|Software Development

9,750 Visits

spreadover

Background


Data Overview

Selected Approach

Potential Issues

Current Status

Preliminary Results

Agenda

A Variety of Approaches

• Given past stories read, what is a registered reader most likely to also read?

• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headline would maximize that visit likelihood?

• Given registered readers attributes, which readers will be most active?

• Given registered reader attributes, which types of content will they read?

PREDICTIVE MODELING

• Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful?

CLUSTER ANALYSIS

ASSOCIATION ANALYSIS

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Potential Issues• Database evolution produces noisy, dirty, unevenly populated data

• Data comes from multiple sources, producing consistent data has been a challenge

• Still not clear if we will end up with enough data to see anything meaningful

• Content taxonomy is relatively new; most likely has real problems with how its structured

• Taxomony measures article subject matter, but behavior stimulating content may be in headlines

• Features are somewhat related:

• Features have high number of discrete values – need to be put into meaningful groupings

• Under-representation of several feature and class values

Industry

Location

Size

TitleSales Employees

Feature Grouping - Location

1

2

3

4

5

6

7

10

9

8

Other11

Feature Grouping - Title• Start with ~ 1000 distinct self-reported Titles in the Database

• Most interested in Title as it correlates with impact, influence on IT buying decisions

• Reclassify them based on three concepts: Senority, Function, Employees in Company

Functional Area 1

Functional Area N

OwnerChairman/CEO

Assistant

Functional Area 1

Functional Area 10

Manager of Managers

Assistant

Manager ofDoer

Doer

1

2,20 - 29

3,30 - 39

4

Result: 24 Categories

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Where I Am In The Process

ProblemDefinition

Data Gathering

Data Prep

Data Mining

Results Analysis Visualiz.

Sum Up Insights

Background


Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

0.7037n = 27

0.1429n = 7

First ResultsQ: Given registered readers attributes, which readers will be most active?

Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records

MSE on Test Set = .1451MSE on Training Set = .1313

n= 786

node), split, n, deviance, yval * denotes terminal node

1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 *

First ResultsQ: Given the attributes of a registered reader, which content types they will read?

Method: Decision Tree Induction

20.45n = 20

35.54n = 232

First ResultsQ: Given registered reader attributes, which types of content will they read?

Method: Kernel SVM with Gaussian Kernel Overall Training Error = .569975

15 |Industries|Hacking 24 |Online|Email 37 |Services|Security16 |Industries|IT Management 25 |Online|IM 42 |Software|Business17 |Industries|Legal 26 |Online|News 43 |Software|Consumer18 |Industries|News 27 |Online|Portal 44 |Software|Networking20 |Industries|PCs 30 |Online|Search 45 |Software|Operating Systems21 |Industries|Standards 33 |Online|Software as a Service 46 |Software|Software Development

% PredictionsWere Accurate True

Pred 0 1 2 5 6 7 9 10 12 13 16 17 18 20 24 25 27 30 33 42 43 44 45 4667% 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 160% 6 0 0 0 0 15 0 0 1 2 1 0 3 0 0 0 0 0 0 0 1 0 0 1 140% 12 0 0 3 0 9 0 0 1 33 1 0 1 5 0 0 0 1 2 0 12 1 3 7 483% 16 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 045% 42 3 0 21 5 29 0 2 1 34 3 5 1 17 1 0 0 5 4 1 151 0 1 44 939% 45 0 2 19 6 20 3 3 4 18 10 10 0 16 2 2 1 2 5 0 42 1 3 126 2867% 46 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 6

% In Class Pred ------------> 0% 0% 4% 0% 20% 0% 0% 0% 37% 0% 25% 0% 0% 0% 0% 0% 0% 0% 0% 73% 0% 0% 71% 12%

Defining Project Success

Success for this project could come in different forms:

• Insights gained on any of the six questions within the project’s scope;

- and/or –

• Insight into how TechJournal should modify its data capture policies to facilitate data mining for the answers to these questions in the future

Questions/Comments

Data Mining At Tech Journal

Documents

aggregated content

wellmost content

personalized content

content headlines

content provider point

registered readers attributes

recordsrecipients table

recordspage visit table