Top Banner
Sidi Chang Insight Data Science Data Engineering Fellow Jul 2016 JustBid
15

Sidi chang demo

Apr 13, 2017

Download

Engineering

Sidi Chang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sidi chang demo

Sidi Chang Insight Data Science Data Engineering Fellow

Jul 2016

JustBid

Page 2: Sidi chang demo

Sealed/blind second price auctionItem

Bidder

Page 3: Sidi chang demo

• Demo

Page 4: Sidi chang demo

Data Pipeline

Simulated

Data

Page 5: Sidi chang demo

Data

• 10K bidders

• Nearly 15 million bidding

Page 6: Sidi chang demo

Recommendation—Jaccard Similarity

Jaccard Similarity:

D_i = user_iC_i = items(user_i)

Page 7: Sidi chang demo

Recommendation

For𝑵 = 𝟏𝟎million,ittakesmorethanayear(AWSm4.largecluster)…

ThenwewillneedtouseminHashAlgorithmwhichcanbeeasilydistributed…

DoanunbiasedestimationbyChernoffBoundsandMarkovInequality:Theexpectederroris

Page 8: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 9: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 10: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1

Hash 2

Page 11: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1

Hash 2

Page 12: Sidi chang demo

MinHash ExampleItem Row User 1 User 2 User 3 User 4 x+1 mod

53x+1

mod 5

1 0 1 0 0 1 1 1

2 1 0 0 1 0 2 4

3 2 0 1 0 1 3 2

4 3 1 0 1 1 4 0

5 4 0 0 1 0 0 3

U1 U2 U3 U4

Hash 1 1 3 0 1

Hash 2 0 2 0 0

Page 13: Sidi chang demo

Performance

Page 14: Sidi chang demo

Challenges• MinHash Algorithm implemented in distributed system

• Jaccard Similarity Tested in distributed system

• Use right data structures to faster computation

• Use both Scala and Python

Page 15: Sidi chang demo

About me• MS in CS and Operations Research