Top Banner
Data Mining At Tech Journal
28

Data Mining At Tech Journal

Jan 03, 2016

Download

Documents

Data Mining At Tech Journal. Agenda. Background. Questions of Interest. Data Overview. Selected Approach. Potential Issues. Current Status. First Results. Agenda. Background. Questions of Interest. Data Overview. Selected Approach. Potential Issues. Current Status. First Results. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining At Tech Journal

Data Mining At Tech Journal

Page 2: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 3: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 4: Data Mining At Tech Journal

The Company

• A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals

• TechJournal is 15 years old; TechPub is 5 years old

• Content for TechPub comes from three sources:

– Aggregated content from public sources

– TechJournal created content

– Peer contributed content

• TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers

Page 5: Data Mining At Tech Journal

The Journal

• The content on the publication website is available to both anonymous and registered users

• Registered users get access to some premium services as well

• Most content is free. Some whitepapers for sale.

• Three very unique features of the site

– Peer contributed content

– Auction system -> readers to get paid to contribute content

– New: personalized content for each reader

Page 6: Data Mining At Tech Journal

• Target: IT Professional involved in their organization’s technology purchasing decision

• Different levels of “readership”:

• The company continuously tries to stimulate new readership through e-mail campaigns

The Readers

E Mail RecipientsAnonymous Visits

E Mail Recipients Visited Site

E Mail Recipients Repeat Visitor

RegisteredLight Reader

RegisteredHeavy Reader

Number ofIndividuals

Page 7: Data Mining At Tech Journal

The Business Model

TechPub ReaderActivity

Knowledge ofReaders'Interests

Quality Of ListProduct

List Value ToTechnology

ManufacturesGathering New

Content

New Readers:Reader Word Of

Mouth

New Readers:Company

Prospecting

CompanyResources ForReinvestment

Total Readers

Tuning ofContent

“Active Readers Produce Better Lists” Loop

“Known Readers Make For Better Journal” Loop

“Success Breeds Success” Loop

“Buzz Marketing” Loop

Page 8: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 9: Data Mining At Tech Journal

Focal Areas For Data MiningTechPub Reader

Activity

Knowledge ofReaders'Interests

Quality Of ListProduct

List Value ToTechnology

ManufacturesGathering New

Content

New Readers:Reader Word Of

Mouth

New Readers:Company

Prospecting

CompanyResources ForReinvestment

Total Readers

Tuning ofContent

• Is TechJournal’s current content taxonomy effective or

would some content taxonomy be more useful?

• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headlines would maximize that visit likelihood?

“Known Readers Make For Better Journal” Loop

“Active Readers Produce Better Lists” Loop

“Success Breeds Success” Loop

• Given registered readers’ attributes, which stories will they be interested in?• Given past stories read, what is a registered reader most likely to also read?• Given registered readers’ attributes, which will be most active?

Page 10: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 11: Data Mining At Tech Journal

The DataMy “Chunk of Data” to Mine:

An Issues Table713,110 records

Issues - Content Linker Table 2,185,664 records

Content Items Table 590 records

Page Visit Table 43,580 records

Recipients Table 195,455 records

Taxonomy Click Table 9,385 records

Page 12: Data Mining At Tech Journal

Attributes to Work With

Reader Attributes Content Attributes Format Attributes

Primary Key Recipient IDIP Address

Content IDIssue ID

Data Mining Attributes TitleCityStateCountryZipPhone IT BudgetEmployeesSalesSIC CodeIndustry Time SentTime OpenedTime of VisitTime Content Click

AbstractHeadline MainContent TypeMedia TypeAuthorContent TaxonomyClick Rate

Template TypeMedia Type (HTML, Or Video)

= Features that can be utilized directly or derived from for Classification

Page 13: Data Mining At Tech Journal

Creating Content Classes

1 1

Classes

5

46

798

1909

5000 +

Level

2

3

4

5

.

.

...

.

21

TechJournal’s current taxonomy for classifying content:

• Manually derived• Aggregation of other credible taxonomy fragments• From a content provider point of view• Goes out to 21 levels in some cases, others as shallow as three 31 Classes

#Visits ContentClass2925 |Software|Business2736 |Hardware|Storage1187 |Software|Operating Systems670 |Hardware|Networking314 |Software|Software Development282 |Hardware|Computers278 |Industries|News131 |Hardware|Telecom118 |Industries|IT Management97 |Hardware|Mobile Devices75 |Online|Search53 |Online|Portal42 |Hardware|Printers40 |Software|Consumer38 |Industries|PCs36 |Industries|Legal32 |Hardware|Power28 |Software|Networking21 |Hardware|News13 |Industries|Standards8 |Hardware8 |Industries|Hacking7 |Online|News4 |Online|Software as a Service4 |Hardware|Chips4 |Services|Disaster Recovery3 |Online|Email3 |Online|IM2 |Services|Security1 |Hardware|Software1 |Services|Software Development

9,750 Visits

spreadover

Page 14: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

Preliminary Results

Agenda

Page 15: Data Mining At Tech Journal

A Variety of Approaches

• Given past stories read, what is a registered reader most likely to also read?

• Given email recipient attributes, what is the likelihood of a visit to website? • Which content headline would maximize that visit likelihood?

• Given registered readers attributes, which readers will be most active?

• Given registered reader attributes, which types of content will they read?

PREDICTIVE MODELING

• Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful?

CLUSTER ANALYSIS

ASSOCIATION ANALYSIS

Page 16: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 17: Data Mining At Tech Journal

Potential Issues• Database evolution produces noisy, dirty, unevenly populated data

• Data comes from multiple sources, producing consistent data has been a challenge

• Still not clear if we will end up with enough data to see anything meaningful

• Content taxonomy is relatively new; most likely has real problems with how its structured

• Taxomony measures article subject matter, but behavior stimulating content may be in headlines

• Features are somewhat related:

• Features have high number of discrete values – need to be put into meaningful groupings

• Under-representation of several feature and class values

Industry

Location

Size

TitleSales Employees

Page 18: Data Mining At Tech Journal

Feature Grouping - Location

1

2

3

4

5

6

7

10

9

8

Other11

Page 19: Data Mining At Tech Journal

Feature Grouping - Title• Start with ~ 1000 distinct self-reported Titles in the Database

• Most interested in Title as it correlates with impact, influence on IT buying decisions

• Reclassify them based on three concepts: Senority, Function, Employees in Company

Functional Area 1

Functional Area N

OwnerChairman/CEO

Assistant

Functional Area 1

Functional Area 10

Manager of Managers

Assistant

Manager ofDoer

Doer

1

2,20 - 29

3,30 - 39

4

Result: 24 Categories

Page 20: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 21: Data Mining At Tech Journal

Where I Am In The Process

ProblemDefinition

Data Gathering

Data Prep

Data Mining

Results Analysis Visualiz.

Sum Up Insights

Page 22: Data Mining At Tech Journal

Background

Questions of Interest

Data Overview

Selected Approach

Potential Issues

Current Status

First Results

Agenda

Page 23: Data Mining At Tech Journal

0.7037n = 27

0.1429n = 7

First ResultsQ: Given registered readers attributes, which readers will be most active?

Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records

MSE on Test Set = .1451MSE on Training Set = .1313

Page 24: Data Mining At Tech Journal

n= 786

node), split, n, deviance, yval * denotes terminal node

1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 *

First ResultsQ: Given the attributes of a registered reader, which content types they will read?

Method: Decision Tree Induction

20.45n = 20

35.54n = 232

Page 25: Data Mining At Tech Journal

First ResultsQ: Given registered reader attributes, which types of content will they read?

Method: Kernel SVM with Gaussian Kernel Overall Training Error = .569975

15 |Industries|Hacking 24 |Online|Email 37 |Services|Security16 |Industries|IT Management 25 |Online|IM 42 |Software|Business17 |Industries|Legal 26 |Online|News 43 |Software|Consumer18 |Industries|News 27 |Online|Portal 44 |Software|Networking20 |Industries|PCs 30 |Online|Search 45 |Software|Operating Systems21 |Industries|Standards 33 |Online|Software as a Service 46 |Software|Software Development

% PredictionsWere Accurate True

Pred 0 1 2 5 6 7 9 10 12 13 16 17 18 20 24 25 27 30 33 42 43 44 45 4667% 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 160% 6 0 0 0 0 15 0 0 1 2 1 0 3 0 0 0 0 0 0 0 1 0 0 1 140% 12 0 0 3 0 9 0 0 1 33 1 0 1 5 0 0 0 1 2 0 12 1 3 7 483% 16 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 045% 42 3 0 21 5 29 0 2 1 34 3 5 1 17 1 0 0 5 4 1 151 0 1 44 939% 45 0 2 19 6 20 3 3 4 18 10 10 0 16 2 2 1 2 5 0 42 1 3 126 2867% 46 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 6

% In Class Pred ------------> 0% 0% 4% 0% 20% 0% 0% 0% 37% 0% 25% 0% 0% 0% 0% 0% 0% 0% 0% 73% 0% 0% 71% 12%

Page 26: Data Mining At Tech Journal

Defining Project Success

Success for this project could come in different forms:

• Insights gained on any of the six questions within the project’s scope;

- and/or –

• Insight into how TechJournal should modify its data capture policies to facilitate data mining for the answers to these questions in the future

Page 27: Data Mining At Tech Journal

Questions/Comments

Page 28: Data Mining At Tech Journal