Page 1
© Nube Technologies
Real Time Fuzzy Matching With Spark and ElasticSearch
Page 2
© Nube Technologies
About Us
The only way to do great work is to love what you do.
- Steve Jobs
Page 3
© Nube Technologies
The problem - lake or swamp?
Page 4
© Nube Technologies
Duplicates
Page 5
© Nube Technologies
Challenges
● Quadratic problem● No standard notion of similarity● Omissions, typos and other issues● Different languages
Page 6
© Nube Technologies
Use Case - Customer Record Dedup
Page 7
© Nube Technologies
Use Case - Customer Record Dedup
Page 8
© Nube Technologies
Use Case - Shopping Site Comparison
Page 9
© Nube Technologies
Use Case - Shopping Site Comparison
Page 10
© Nube Technologies
Other Use Cases
● Cross selling● Financial Credit Ratings● Fraud Analytics● Catalog and inventory management● Household and individual level analytics.
Page 11
© Nube Technologies
Lets start wishing...
● Data variety● Scalable● No manual configuration of rules or
algorithms● Multi language● Real time
Page 12
© Nube Technologies
Reifier - learn
Page 13
© Nube Technologies
Reifier - learn
Page 14
© Nube Technologies
Reifier - learn
Page 15
© Nube Technologies
Reifier - learn
Page 16
© Nube Technologies
Real Time
Spark + ElasticSearch
Page 17
© Nube Technologies
Spark Benefits
● Distributed● Scalable● Fast● Machine Learning● Sampling● No need to orchestrate multiple jobs
Page 18
© Nube Technologies
Thank You!
www.nubetech.co [email protected]