[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Post on 21-Nov-2014

844 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed. This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.

Transcript

Machine Learning at Indeed

Scaling Decision Trees

Andrew HudsonCTO

I help people get jobs.

Indeed is aSearch Engine for Jobs

Which jobs to show?

18,749 jobs

Which jobs to show?

Maximize job seeker’s chance to get the job

Which jobs to show?

Maximize job seeker’s chance to get the job

● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?

Which jobs to show?

Maximize job seeker’s chance to get the job

● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?

How?

Log job seeker behavior

Analyze logs, what best explains why they clicked on some jobs and not on others?

May help predict future behavior

How?

Log job seeker behavior

Analyze logs, what best explains why they clicked on some jobs and not on others?

May help predict future behavior

Supervised learning

Supervised Learning Approaches

Neural networks Bayesian methods Decision trees

Genetic programming

Logistic model tree Nearest neighbor

Support Vector Machines

Random forests Boosting

Bagging Regression Ensemble methods

Supervised Learning Approaches

Neural networks Bayesian methods Decision trees

Genetic programming

Logistic model tree Nearest neighbor

Support Vector Machines

Random forests Boosting

Bagging Regression Ensemble methods

Supervised Learning Approaches

Decision trees

Genetic programming

Logistic model tree

Random forests Boosting

Bagging Ensemble methods

Decision Trees

What is a Decision Tree?

A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction

I’m Thinking About Buying a Laptop

I’m Thinking About Buying a Laptop

Is quality important?

I’m Thinking About Buying a Laptop

ASUSIs quality important?NO

I’m Thinking About Buying a LaptopASUS -or whatever woot hasIs quality important?

NO

I’m Thinking About Buying a Laptop

YES

ASUS -or whatever woot has

NO

Want to run linux?

Is quality important?

I’m Thinking About Buying a Laptop

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

YES

I’m Thinking About Buying a Laptop

LENOVO

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

I’m Thinking About Buying a Laptop

DELLIDGAF

SYSTEM76HELLYESYES

LENOVO

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

Benefits of Decision Trees

Algorithm relatively simple to understand and implement

Model produced also human understandable

Decision Tree Learning

Programmatic creation of decision trees

Decision Tree Learning

Given a set of documents, split it into two or more subsets that optimize some criteria

Repeat this process until a set can no longer be split

Titanic Example

1309 passengers500 survivors38.2% survival rate

What best explains who survived?

classclass of ticket; first, second or third

fsizefamily size; number of family members onboard

gendermale or female

What best explains who survived?

1309 passengers500 survivors

38.2% survival

1309 passengers500 survivors

38.2% survival

class = 1

1309 passengers500 survivors

38.2% survival

class = 1323 passengers

200 survivors61.9% survival

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

1309 passengers500 survivors

38.2% survival

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = ?1309 passengers

500 survivors38.2% survival

Score

conditional entropy

Conditional Entropy as Score

lower conditional entropy↓

less uncertainty about prediction based on term

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = 0.62671309 passengers

500 survivors38.2% survival

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = 0.6267 Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class = 1

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Score = 0.6244

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Score = 0.6244

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

class ≠ 3600 passengers

319 survivors53.2% survival

class = 3709 passengers

181 survivors25.5% survival

Score = 0.6244

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender = female

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender = female466 passengers

339 survivors72.7% survival

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Score = 0.5525

Best Score:0.5525, gender=f

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Score = 0.5525

Best Score:0.5525, gender=f

1309 passengers500 survivors

38.2% survival

fsize = 0790 passengers

239 survivors30.3% survival

Score = 0.6448

fsize ≠ 0519 passengers

261 survivors50.3% survival

Best Score:0.5525, gender=f

19.1% survival

72.7% survival

gender=male843 passengers

161 survivors19.1% survival

gender=male843 passengers

161 survivors19.1% survival

class = 1179 passengers

61 survivors34.1% survival

gender=male843 passengers

161 survivors19.1% survival

class = 1179 passengers

61 survivors34.1% survival

class ≠ 1664 passengers

100 survivors15.1% survival

Score = 0.4700

class = 1 class ≠ 1

class = 1 class ≠ 1

15.1% survival

34.1% survival

38.2%

72.7%19.1%MALE

38.2%

FEMALE

72.7%19.1%MALE

34.1%15.1%

38.2%

FEMALE

CLASS≠1 CLASS=1

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

38.2%

FEMALE

CLASS≠1 CLASS=1

FSIZE≠2 FSIZE=2

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

93.2%49.1%

38.2%

FEMALE

CLASS≠1 CLASS=1 CLASS>2 CLASS<=2

FSIZE≠2 FSIZE=2

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

93.2%49.1%

24.4% 54.9%

38.2%

FEMALE

CLASS≠1 CLASS=1 CLASS>2 CLASS<=2

FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2

Predicting Click Probabilities

Passenger → Job ImpressionSurvived → Clicked on Job

For each candidate job, follow path through tree then take click through rate of terminal node

sales account manager

representative manager associate

outside service inside

YES

2.9%

YES

4.4%

YES

2.1%

NO

NO

YES

2.9%

NO

YES

1.8%

NO

1.9%

NO

2.6%

NO NO NO

4.6%

NO

3.8%

YES

YES

5.1%

YES

YES

Simplified Decision Tree for query="sales"

sales

YES

representative

YES

outsideNO

serviceNO

insideNO

4.6%

job title = “sales representative”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

3.8%

YES

2.1%

accountNO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

job title = “account executive”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

3.8%

YES

2.1%

accountNO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

sales

3.8%

account

NO NO NO

YES

representative

YES

outside service inside 4.6%

YES

2.9%

YES

4.4%

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO NO NO

service inside 4.6%

salesNO

representative

outside

5.1%

job title = “outside sales representative”

account

3.8%YES

YES

YES

YES

YES

manager

1.8%

associate

job title = “sales associate”

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

sales

representative

YES

NO

2.6%

YES

2.9%

NO NO

YES

sales

representative

outside service

4.4%

inside

job title = “inside sales representative”

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO

account

3.8%

YES

YES YES

2.9%5.1%

4.6%

YES

NO NO NO

YES

YES

NO

2.9%

manager

job title = “sales manager”

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

YES

sales

representative

YES

YES

1.8%

associateNO

2.6%NO

YES

job title = “sales consultant”

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

YES

YES

2.9%

YES

1.8%

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

manager associate

sales

representative

YES

NO

2.6%NO NO

sales account manager

job title = “store manager”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

2.9%

managerNO

YES

1.8%

associateNO

2.6%

NO NO NO

YES

representative

YES

outside service inside 4.6%

NO

1.9%

YES

2.1%3.8%

YES

NONO

2.9%

job title = “service sales representative”

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO

account

3.8%

YES

YES

4.4%

NO

inside 4.6%NO

YES

5.1%

sales

representative

outside service

YES

NO

YES

YES

job title = “customer service representative”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

2.9%

managerNO

YES

1.8%

associateNO

2.6%

NO NO NO

representative

YES

outside service inside 4.6%

YES

sales accountNONO

managerNO

1.9%

YES

2.1%3.8%

YES

Final CTR Predictions

5.1% outside sales representative4.6% sales representative4.4% inside sales representative3.8% account executive2.9% sales manager2.9% service sales representative2.6% sales consultant2.1% store manager1.9% customer service representative1.8% sales associate

Single Machine Implementation

Overview

Tree Building Strategies

One node at a time- depth first- breadth first

1

Depth First

1

2 3

Depth First

1

3

Depth First

2

4 5

1

3

Depth First

2

5

6 7

4

1

3

Depth First

2

54

6 7

1

3

Depth First

2

54

6 7

1

3

Depth First

2

4

6 7

5

1

Depth First

2

4

6 7

5

3

8 9

1

Breadth First

1

2 3

Breadth First

1

3

Breadth First

2

4 5

1

Breadth First

2

4 5

3

6 7

1

Breadth First

2

5

3

6 7

8 9

4

1

Breadth First

2 3

6 7

8 9

4 5

1

Breadth First

2 3

8 9

4 5

10 11

6 7

1

Breadth First

2 3

8 9

4 5

10 11

6

12 13

7

Tree Building Strategies

One node at a time- depth first- breadth first

One layer at a time, all nodes simultaneous

1

1iter #1

1

2 3

iter #1

1

2 3

iter #1

iter #2

1

2

4

3

5 6 7

iter #1

iter #2

iter #3

1

2 3

5 6 7

iter #1

iter #2

4

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

iter #48 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

Data Formatid class fsize gender survived

0 1 0 f 1

1 1 3 m 1

2 1 3 f 0

3 1 3 m 0

4 1 3 f 0

5 1 0 m 1

6 1 1 f 1

7 1 0 m 0

8 1 2 f 1

9 1 0 m 0

id class fsize gender survived

10 1 1 m 0

11 1 1 f 1

12 1 0 f 1

13 1 0 f 1

14 1 0 m 1

15 1 0 m 0

16 1 1 m 0

17 1 1 f 1

18 1 0 f 1

19 1 0 m 0

….

Data Format

Create an inverted index

Key to efficiently building one layer at a time

Inverted Index

Maps terms to the list of documents that contain that term

Terms and docs stored in sorted order

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Field

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Term

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Inverted Index

fsize=0 → 0,5,7,9,12,13,14,15,18,19,22….fsize=1 → 6,10,11,16,17,26,27,36,49,50….fsize=2 → 8,20,21,42,76,77,78,79,81,82….fsize=3 → 1,2,3,4,54,55,56,57,90,339….fsize=4 → 249,250,251,252,253,449,806….….

Inverted Index

gender=f → 0,2,4,6,8,11,12,13,17,18,21….gender=m → 1,3,5,7,9,10,14,15,16,19,20….

Inverted Index

survived=0 → 2,3,4,7,9,10,15,16,19,25….survived=1 → 0,1,5,6,8,11,12,13,14,17….

Inverted Index Implementations

Lucene

Flamdex

Primary Lookup Tables

groups[doc]Where in the tree each doc isInitialized to all ones, all docs start in root

values[doc]Value to be classified, for each docIn this case it’s 1 if survived, 0 otherwise

Primary Lookup Tables

values[doc]

Constructed from an inverted index of the values

Invert the field of interest (e.g. survived)

Main Loop Overview

foreach fieldforeach term

get group statsevaluate splits

apply best splitsrepeat n times or until no more splits found

Main Loop - First Iteration

foreach field (class, fsize, gender)

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group stats

Get Group Stats

count[grp]Count of how many documents within that group contain current term, initialized to zeros

vsum[grp]Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros

Get Group Stats

for current field/term

Get Group Stats

for current field/termforeach doc

Get Group Stats

for current field/termforeach doc

grp = grps[doc]

Get Group Stats

for current field/termforeach doc

grp = grps[doc]if grp == 0 skip

Get Group Stats

for current field/termforeach doc

grp = grps[doc]if grp == 0 skipcount[grp]++vsum[grp] += vals[doc]

Get Group Stats

for current field/term (class=1)

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skip

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 0, vsum[1] = 0

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 1, vsum[1] = 1

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 2, vsum[1] = 2

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 3, vsum[1] = 2

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 4, vsum[1] = 2

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 5, vsum[1] = 2

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 6, vsum[1] = 3

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 323, vsum[1] = 200

1309 passengers500 survivors

38.2% survival

class = 1

1309 passengers500 survivors

38.2% survival

class = 1

Group 1

1309 passengers500 survivors

38.2% survival

class = 1

Group 1

1309 passengers500 survivors

38.2% survival

class = 1323 passengers count[1]

Group 1

1309 passengers500 survivors

38.2% survival

class = 1323 passengers

200 survivors

count[1]

vsum[1]

Group 1

Get Group Stats

for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)

…count[1] = 277, vsum[1] = 119

Get Group Stats

for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)

…count[1] = 709, vsum[1] = 181

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group statsevaluate splits

Evaluate Splits

Consider current field/term as a potential split for each group

1) check if split is admissiblebalance check, significance check

2) score the splitconditional entropy or some other heuristic

3) keep best scoring split

Evaluate Splits

totalcount[group] / totalvalue[group]Total number of documents and total values for each group, i.e. # passengers / # survivors

bestsplit[group] / bestscore[group]Current best split and score for each group, initially nulls

foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group

if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],

totcnt[grp], totval[grp])if score < bestscore[grp]

bestscore[grp] = scorebestsplit[grp] = field/term

foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group

if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],

totcnt[grp], totval[grp])if score < bestscore[grp]

bestscore[grp] = scorebestsplit[grp] = field/term

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group statsevaluate splits

apply best splits (bestsplit[1]=“gender=f”)

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1 1

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=female

1

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3

3

1

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3

1

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3

1

Apply Best Splits

Using inverted index, iterate over docs that match split condition

If current document is in targeted group, move it to the positive group

At the end, move anything left in target group to negative group

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 2 group[14] = 2group[1] = 2 group[8] = 3 group[15] = 2group[2] = 3 group[9] = 2 group[16] = 2group[3] = 2 group[10] = 2 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 2 group[12] = 3 group[19] = 2group[6] = 3 group[13] = 3 group[20] = 2

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splitsrepeat n times or until no more splits found

1

iter #1

1

iter #1

gender = female

1

2 3

iter #1

gender = femalegender ≠ female

1

iter #1

iter #22 3

iter #1

1

Main Loop - Second Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group stats

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[2] = 179, vsum[2] = 61count[3] = 144, vsum[3] = 139

Get Group Stats

for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)

grp = grps[doc] (2,3,2,2,2,2,3,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)

…count[2] = 171, vsum[2] = 25count[3] = 106, vsum[3] = 94

Get Group Stats

for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)

grp = grps[doc] (2,2,2,3,3,2,2,3,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)

…count[2] = 493, vsum[2] = 75count[3] = 216, vsum[3] = 106

Get Group Stats

for current field/term (gender=female)foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….)

grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…)

…count[2] = 0, vsum[2] = 0count[3] = 467, vsum[3] = 339

Get Group Stats

for current field/term (gender=male)foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...)

grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…)

…count[2] = 844, vsum[2] = 161count[3] = 0, vsum[3] = 0

What AboutInequality Splits?

e.g. class ≤ 2

Main Loop + Inequality Splits

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsupdate inequality statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsupdate inequality statsevaluate splitsevaluate inequality splits

apply best splits for each grouprepeat n times or until no more splits found

Scalability

Performs quite well on a single machine

Worked well for a while, but started to hit limits

Ultimately needed to distribute to multiple machines

Multiple Machine Implementation

Hadoop?

Hadoop

Experimented with using Hadoop

Each level took five sequential map reduce jobs

Much slower than single machine; repeatedly writes intermediate data and lots of shuffling

Hadoop

Experimented with using Hadoop

Each level took five sequential map reduce jobs

Much slower than single machine; repeatedly writes intermediate data and lots of shuffling

Hadoop not great for iterative algorithms

Partition Data

Inverted Index

Inverted Index

Inverted Index

Inverted Index

Shard 1 Shard 2

Shard 1 Shard 2

Machine 1 Machine 2

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Main Loop

foreach field

foreach termget group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Main Loop

foreach field

foreach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

Sorted

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119class=3|1|709|181

fsize=0|1|790|239fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239fsize=1|1|235|126

fsize=2|1|159|90fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126fsize=2|1|159|90

fsize=3|1|43|30fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90fsize=3|1|43|30

fsize=4|1|22|6fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30fsize=4|1|22|6

fsize=5|1|25|5fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6fsize=5|1|25|5

fsize=6|1|16|4fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5fsize=6|1|16|4

fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339gender=m|1|843|161

FTGS Stream - Single Machine

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

FTGS Stream

How to distribute?

Shard 1 Shard 2

Machine 1 Machine 2

Shard 1 Shard 2

FTGS 1

Machine 2

Shard 1 Shard 2

FTGS 1 FTGS 2

Shard 1 Shard 2

FTGS 1 FTGS 2

Machine 3

Shard 1 Shard 2

FTGS 1 FTGS 2Merge

Machine 3

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

Machine 1

FTGS Stream Merge

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

Machine 2

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

Machine 1 Machine 2

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=1|1|323|200

+

Machine 1 Machine 2

FTGS Stream Merge

class=1|1|198|111

class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=1|1|323|200

Machine 1 Machine 2

FTGS Stream Merge

class=1|1|198|111

class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

+

Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=0|1|790|239

class=3|1|709|181class=2|1|277|119

class=1|1|323|200Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239

fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=0|1|790|239class=3|1|709|181

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239

fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=1|1|235|126fsize=0|1|790|239

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

+

Machine 1 Machine 2

Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

k-way merge

FTGS 1-6

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS 1-18

FTGS 1-18 FTGS 19-36

FTGS 1-36

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Regroup

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS

Regroup 1-6 Regroup 7-12 Regroup 13-18

Regroup

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS

Regroup 1-6 Regroup 7-12 Regroup 13-18

Regroup

Imhotep

Imhotep

Distributed System that does efficient FTGS and Regroup operations on inverted indexes

Imhotep

32 machines

2 cpu x 6 core xeon westmere E5649128GB RAM10x1TB 7200 RPM SATA

Total:384 cores, 4TB RAM, 320TB disk

Decision tree on 13 billion documents

Imhotep

Decision tree on 13 billion documents330GB → ~25 bytes per doc

Imhotep

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 secondsFirst Regroup: 9.6 seconds

Imhotep

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds

Imhotep

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)

Imhotep

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)

Second FTGS: 57 secondsSecond Regroup: 23 seconds (217 groups)

Imhotep

Imhotep

Distributed System that does efficient FTGS and Regroup operations

Powers our internal analytical tools

Imhotep

Distributed System that does efficient FTGS and Regroup operations

Powers our internal analytical tools

… and more

Imhotep - Next @IndeedEng Talk

Sharding and shard managementSession / FTGS network protocolMemory managementInverted IndexesFTGS MergeRegroup operationsFault Tolerance

Conclusion

Now scales to larger and larger data sets by adding more machines

Increased freshness and frequency of builds

Decision trees have lots of tunable components, regularly get 1% wins via A/B test

Continuous Improvement

Sponsored Job Click-through Rate (CTR)

Thanks.

Q & A

More Questions?Jason David James Jeff

Next @IndeedEng TalkImhotep: Large Scale Analytics

and Machine Learning at Indeed

Jeff Plaisance, Engineering ManagerMarch 26, 2014

http://engineering.indeed.com/talks

top related