Top Banner
Cardlytics & Drill Use Case: Matching Big Data David Kim Principal Engineer 2015.06.25
13

Atlhug 20150625

Aug 03, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Atlhug 20150625

Cardlytics & Drill Use Case: Matching Big Data

David Kim

Principal Engineer

2015.06.25

Page 2: Atlhug 20150625

About Cardlytics

© 2013 Cardlytics. Proprietary and Confidential. 2

•  Privately  held  company  leveraging  proprietary  purchase-­‐driven  intelligence  pla6orm  to  provide  ac7onable  insights  into  consumer  behavior  to  numerous  organiza7ons  using  consumer  purchase  data  that  we  have  exclusive  rights  to  

•  Founded  in  2008  by  ScoA  Grimes  (CEO)  and  Lynne  Laube  (COO)  both  former  execu7ves  at  Capital  One  

•  Headquartered  in  Atlanta,  we  have  320  employees  with  offices  in  NY,  Chicago,  San  Francisco  &  London    

•  Owns  mul7ple  patents  and  nearly  700  banking  rela7onships  in  the  US  and  the  UK  represen7ng  over  100  million  households  and    $1  trillion  in  yearly  spend  

Page 3: Atlhug 20150625

Problem Statement

A customer (advertiser) requested analysis to provide insight into their own business and customer base in order to better understand and make better business decisions. •  Must match advertiser customers to Cardlytics customers

•  Matches must be highly confident and unique

© 2013 Cardlytics. Proprietary and Confidential. 3

Page 4: Atlhug 20150625

Our Approach: Pattern Matching

time

© 2013 Cardlytics. Proprietary and Confidential. 4

Page 5: Atlhug 20150625

Challenges

•  Matches must be unique

•  Matches must be highly confident

•  Limited information available to match data points (no PII)

•  Missing data points

•  Scale (Drill) »  Depending on the advertiser, data points are sparse or densely packed

© 2013 Cardlytics. Proprietary and Confidential. 5

Page 6: Atlhug 20150625

Scale Issues with Dense Data Points

© 2013 Cardlytics. Proprietary and Confidential. 6

Page 7: Atlhug 20150625

Scale Issues

•  60M x 40M = 2.4T potential matches evaluated

•  120M x 120M = 14.4T potential matches evaluated

•  590M x 130M = 76.7T potential matches evaluated

© 2013 Cardlytics. Proprietary and Confidential. 7

Page 8: Atlhug 20150625

Our Environment…

SQL Server: 64 cores (32 physical), 256GB RAM, direct-attached storage w/enterprise disks

Hadoop Cluster: 10 nodes, 32 cores/node, 128GB RAM, 12 x 4TB consumer grade disks

© 2013 Cardlytics. Proprietary and Confidential. 8

Page 9: Atlhug 20150625

Actual Results…

•  POC 1: 60M customer data points x 40M Cardlytics data points collected over 2 years

»  SQL Server : ~20 hours

•  POC 2:120M x 120M over 6 months

»  SQL Server: 1~2 months »  Hive: Killed after several days (estimated to take about a week) »  Drill: 17-18 hours yielding 91+B matching data points

•  POC3: 590M x 130M over 1 year »  Drill: ~17 hours to yield 1.3T matches and 72TB »  Required some tweaking and turning some secret knobs

© 2013 Cardlytics. Proprietary and Confidential. 9

…PROBABLY

Page 10: Atlhug 20150625

…from the MapR Drill team

Compliments of Jacques Nadeau/Aman Sinha

•  store.format

•  store.parquet.block-size

•  planner.broadcast_threshold

•  planner.broadcast_factor

•  planner.join.row_count_estimate_factor

•  planner.enable_multiphase_agg

•  planner.enable_mux_exchange

•  exec.min_hash_table_size

•  planner.enable_hashjoin

•  select * from sys.options;

© 2013 Cardlytics. Proprietary and Confidential. 10

Page 11: Atlhug 20150625

Other Nuggets

•  Drill is memory intensive

•  You will always know more about your data than Drill

•  Hadoop and Drill are great tools but doesn’t solve stupidity

•  Some of the basic principles of querying a dataset still apply

»  Intelligent batching

»  Applying filters early to work with smaller datasets »  Bringing back only the data that you need

»  Partitioning

»  Understanding the configurations and internals of your tools

© 2013 Cardlytics. Proprietary and Confidential. 11

Page 12: Atlhug 20150625

"Louis, I think this is the beginning of a beautiful friendship."

Our close partnership with MapR includes… •  Semi-weekly check-ins with Drill dev team

•  Weekly check-ins with MapR product managers

•  Improving Drill with real world applications, tests, and data

•  Input to future roadmap »  Large IN-clause

»  DST support

»  Auto-partitioning

»  Windowing functions

»  Support for inserts

© 2013 Cardlytics. Proprietary and Confidential. 12

Page 13: Atlhug 20150625

Grab a seat at the cool kids’ table!!

Careers @Cardlytics

http://cardlytics.com/cardlytics/?s=career

Apache Drill

https://drill.apache.org/

https://drill.apache.org/docs/

MapR

https://www.mapr.com/products/product-overview/apache-drill

© 2013 Cardlytics. Proprietary and Confidential. 13

Michael Fabacher, VP of Data Development [email protected] David Kim, Principal Engineer [email protected]