Top Banner
Context from Big Data Startup Showcase IEEE Big Data Conference November 1, 2015 Santa Clara, CA Delroy Cameron, Data Scientist @urxtech | urx.com | [email protected]
20

Context from Big Data

Jan 21, 2017

Download

Data & Analytics

URX
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Context from Big Data

Context from Big DataStartup Showcase

IEEE Big Data ConferenceNovember 1, 2015

Santa Clara, CA

Delroy Cameron, Data Scientist

@urxtech | urx.com | [email protected]

Page 2: Context from Big Data

PeopleURX has 40 people: 75%

product/eng, 25% business

CustomersURX partners with the world’s top publisher & advertisers.

FundingURX raised $15M from Accel, Google Ventures, and others

Who is URX?

URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.

Page 3: Context from Big Data

What problem does URX solve?

Page 4: Context from Big Data

URX serves contextually relevant native ads.

URX interprets page context to dynamically determine the best message & action.

Page 5: Context from Big Data

How does URX affect the mobile ecosystem?

Page 6: Context from Big Data

Volume (Apps)Volume (web

pages)Variety (entities)

Why is this a Big Data problem?

Rhapsody(Music)

Fansided

(Sports)

Apple(Music, TV, Books)

Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

1.6M Apps (Android)1.5M Apps (Apple Store)

Page 7: Context from Big Data

How do we collect, store, and process the data needed to build our machine learning models?

Page 8: Context from Big Data

1.Data Collection and Parsing2.Data Storage

• Persistent Storage• Search Index

3.Data Processing• Dictionary Building • Vectorization (Feature Vector Creation)

Important tasks

Page 9: Context from Big Data

11GB XML dump (gzip file)15M pages (but only 4M articles) Wikitext Grammar

Wikipedia Corpus (English)

1. Data collection & parsing

https://dumps.wikimedia.org/enwiki/latest/

<page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve">

#REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}}</text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> </revision> </page>

Page 10: Context from Big Data

1. Data collection & parsing

https://dumps.wikimedia.org/enwiki/latest/

Page 11: Context from Big Data

1. Data collection & parsing

sax library, generator20 secs/doc, 10 years

FullWikiParser (mediawikiparser)

sax library, generator200 docs/sec, ~ 21 hours

FastWikiParser (mwparserfromhell)

hbase, lxml parser6 docs/sec, ~ one month

HTMLWikiParser (URX Index)

multithreading, generator~ 3 hours

GensimWikiCorpusParser

1. pyspark (64 cores, 8GB RAM)2. wikihadoop

(StreamWikiDumpInputFormat)• split input file

3. mwparserfromhell• parse to raw text

4. ~20 minutes

wikipedia-parser

Page 12: Context from Big Data

wik

iped

ia-in

dexe

r

datanode 1

Namenode

datanode 2

datanode n

.

.

.

HDFS Elasticsearch Index

ClusterNode1

ClusterNode 2

ClusterNode m

.

.

.

2. Data storage

wik

iped

ia-p

arse

r

Page 13: Context from Big Data

(0 taylor) . . . (1999995 zion)

(1 alison) . . . (1999996 dozer)

(2 swift) . . . (1999997 tank)

(3 born) . . . (1999998 trinity)

(4 december) . . . (1999999 neo)

3. Data Processor (Dictionary building)

wikihadoop, StreamWikiDumpInputFormatdictionary, tfidfmodel~ 1 hour

Pyspark (Gensim)

multithreading, generatorcorpus, dictionary, tfidfmodel~ 6 hours

GensimWikiCorpusParser

Page 14: Context from Big Data

Alias Candidate Entity f1 f2 … fn

Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34

wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42

wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31

wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23

wikipedia:John_Swift 0.00 0.19 … 0.56

4. Data Processor (Vectorization)

~ 350ms predict entity per alias

Gensim

~ 100ms predict entity per alias

Cython

Page 15: Context from Big Data

WikipediaCorpus

corpus-parser

corpus-indexer

HDFS(Wikilinks)

WikilinksCorpus

XCorpus

Data Processor Dictionary TF-IDF Model

Machine Learning Module

HDFS (Wikipedia)

HDFS(X Corpus)

Elasticsearch1

Elasticsearch2

Elasticsearchn

1

2

3

4

56

7

Page 16: Context from Big Data

Demo

Page 17: Context from Big Data

Linked Entities1. http://en.wikipedia.org/wiki/Macgyver2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson4. http://en.wikipedia.org/wiki/Josh_Holloway5. http://en.wikipedia.org/wiki/NBC6. http://en.wikipedia.org/wiki/CBS7. http://en.wikipedia.org/wiki/James_Wan8. http://en.wikipedia.org/wiki/Netflix9. http://en.wikipedia.org/wiki/America_America

http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/

Page 18: Context from Big Data

● Tuning pyspark jobs (64 cores, 8GB Driver RAM)

● Bringing down the elasticsearch cluster

● Rejoining the union after secession (elasticsearch

nodes)

● Text Cleaning (lowercasing, character encoding)

● Merging in Hadoop for dictionary creation

Things to watch out for

Page 20: Context from Big Data

Thank [email protected]