Top Banner
FeatureRank - An Insight Data Science Consulting Project Kuhan Wang October 8th, 2015 1 / 11
11

Insight Consulting Project

Apr 12, 2017

Download

Data & Analytics

Kuhan Wang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Insight Consulting Project

FeatureRank

-

An Insight Data Science Consulting Project

Kuhan Wang

October 8th, 2015

1 / 11

Page 2: Insight Consulting Project

Consulting Scenario

Company X wishes to maximize user engagement throughoptimal placement of advertisements on content URLs.

Ad Type: Tourism

Keyword: Cuba

Keyword: Package Tour

Keyword: Airplane

Ad Type X

Keyword 1

Keyword 2

Keyword 3

Keyword N

.

.

.

Example: Tourism ads not ideal on investment content URL.

2 / 11

Page 3: Insight Consulting Project

A Pipeline to Analyze Textual Features

Developed and implemented a pipeline to analyzeimportance of textual feature on content URLs relative toengagement.

Scrape URL

Process Text

Model Features

Extract Keywords

Update Keywords

Collect Data, Reiterate

Begin

3 / 11

Page 4: Insight Consulting Project

User Engagement Data

Occurrences

Cou

nts

Summary of Engagement Data

Page Loaded

Ad Viewed

Ad Clicked

Summary of Engagement Data

4 / 11

Page 5: Insight Consulting Project

Modeling

Attempted linear regression.Classify engagement as yes/no.- Features are bags of words from content URL.

Word Count0 1 2 3 4 5 6 7 8 9 10

Pro

babi

lity

[%]

0

0.2

0.4

0.6

0.8

1

Logistic Classification Model

Ad Clicked

Ad Not Clicked

Logistic Classification Model

5 / 11

Page 6: Insight Consulting Project

Validation

Randomly split data into training/test sets.- Generate distribution of validation scores.

Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85

Rec

all

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Num

ber

of M

C T

oys

0

5

10

15

20

25

30

Distribution of Precision vs Recall

⟩ Precision, Recall ⟨

6 / 11

Page 7: Insight Consulting Project

Deliverables

Extracted keywords:

Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 41 debt coordinator mortgage gold2 gift administrative home 03 profit minimum procurement stock4 check minimum wage loan fund5 balance reports trustee event

Pipeline in Python is delivered to company forimplementation.

7 / 11

Page 8: Insight Consulting Project

About Myself

PhD Particle Physics, McGill University, researcher on theLarge Hadron Collider.

Lead the search for black holes and string objects as part ofthe ATLAS Collaboration.

About project and myself at http://kuhanw.zohosites.com/.

8 / 11

Page 9: Insight Consulting Project

Backup

Feature Frequency/Documents0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e N

umbe

r of

Doc

umen

ts [%

]

4−10

3−10

2−10

1−10

1

Ad Type 1Ad Type 1

9 / 11

Page 10: Insight Consulting Project

10 / 11

Page 11: Insight Consulting Project

FeatureRank

Kuhan Wang1

1. Insight Data Science

October 2, 2015

Abstract

FeatureRank is a software tool for extracting correlations between textngram features and user engagement, thereby optimizing the placementof financial widgets on URL articles.

1 Directory Structure

• /

processing.py

Pre-processing to parse relevant information from engagement csv files.

crawl.py

A simple web crawler that pulls the title and < p > tag text from URLs.

FeatureRank.py

Driver file to execute main functions.

feature_extraction_model.py

The core program that contains the machine learning algorithms.

post_processing.py

Post processing to produce evaluation metrics and ngram rankings.

web_text_data_set_1_2.json

A file containing the sorted JSON dictionaries of each URL, this is theinput to FeatureRank.

read_json.py

A script converting the JSON file into a format that can be read into themodel learning functions.

1

11 / 11