FeatureRank - An Insight Data Science Consulting Project Kuhan Wang October 8th, 2015 1 / 11
FeatureRank
-
An Insight Data Science Consulting Project
Kuhan Wang
October 8th, 2015
1 / 11
Consulting Scenario
Company X wishes to maximize user engagement throughoptimal placement of advertisements on content URLs.
Ad Type: Tourism
Keyword: Cuba
Keyword: Package Tour
Keyword: Airplane
Ad Type X
Keyword 1
Keyword 2
Keyword 3
Keyword N
.
.
.
Example: Tourism ads not ideal on investment content URL.
2 / 11
A Pipeline to Analyze Textual Features
Developed and implemented a pipeline to analyzeimportance of textual feature on content URLs relative toengagement.
Scrape URL
Process Text
Model Features
Extract Keywords
Update Keywords
Collect Data, Reiterate
Begin
3 / 11
User Engagement Data
Occurrences
Cou
nts
Summary of Engagement Data
Page Loaded
Ad Viewed
Ad Clicked
Summary of Engagement Data
4 / 11
Modeling
Attempted linear regression.Classify engagement as yes/no.- Features are bags of words from content URL.
Word Count0 1 2 3 4 5 6 7 8 9 10
Pro
babi
lity
[%]
0
0.2
0.4
0.6
0.8
1
Logistic Classification Model
Ad Clicked
Ad Not Clicked
Logistic Classification Model
5 / 11
Validation
Randomly split data into training/test sets.- Generate distribution of validation scores.
Precision0.55 0.6 0.65 0.7 0.75 0.8 0.85
Rec
all
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Num
ber
of M
C T
oys
0
5
10
15
20
25
30
Distribution of Precision vs Recall
⟩ Precision, Recall ⟨
6 / 11
Deliverables
Extracted keywords:
Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 41 debt coordinator mortgage gold2 gift administrative home 03 profit minimum procurement stock4 check minimum wage loan fund5 balance reports trustee event
Pipeline in Python is delivered to company forimplementation.
7 / 11
About Myself
PhD Particle Physics, McGill University, researcher on theLarge Hadron Collider.
Lead the search for black holes and string objects as part ofthe ATLAS Collaboration.
About project and myself at http://kuhanw.zohosites.com/.
8 / 11
Backup
Feature Frequency/Documents0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e N
umbe
r of
Doc
umen
ts [%
]
4−10
3−10
2−10
1−10
1
Ad Type 1Ad Type 1
9 / 11
10 / 11
FeatureRank
Kuhan Wang1
1. Insight Data Science
October 2, 2015
Abstract
FeatureRank is a software tool for extracting correlations between textngram features and user engagement, thereby optimizing the placementof financial widgets on URL articles.
1 Directory Structure
• /
processing.py
Pre-processing to parse relevant information from engagement csv files.
crawl.py
A simple web crawler that pulls the title and < p > tag text from URLs.
FeatureRank.py
Driver file to execute main functions.
feature_extraction_model.py
The core program that contains the machine learning algorithms.
post_processing.py
Post processing to produce evaluation metrics and ngram rankings.
web_text_data_set_1_2.json
A file containing the sorted JSON dictionaries of each URL, this is theinput to FeatureRank.
read_json.py
A script converting the JSON file into a format that can be read into themodel learning functions.
1
11 / 11