Top Banner
Company Profile Сегментация пользователей в online-рекламе Spark vs Hadoop Сергей Жемжицкий, CTO, CleverDATA, 22 мая, 2015
28

Hadoop meetup zhemzhitsky

Aug 11, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop meetup zhemzhitsky

Company Profile Сегментация пользователей в online-рекламе

Spark vs Hadoop

Сергей Жемжицкий, CTO, CleverDATA, 22 мая, 2015

Page 2: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

International market business development since 2012

One of three leading IT companies in Russia 43 branches in Russia and abroad +5500 employees 100K projects for 10K customers

Data management innovative platform (Data Exchange Service) Cloud Service In-house development

Internet advertising solutions Data Management Platforms Customers Base Management Web Analytics Marketing automation

Big Data Data Mining Digital Intelligence Operational Intelligence Low Latency and NoSQL Cloud Computing

Page 3: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Агенда

• Про задачу; • Hadoop vs. Spark; • Особенности; • Что дальше.

Page 4: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

publishers

AD NETWORK AD NETWORK

AD NETWORK AD NETWORK

AD NETWORK AD NETWORK

advertisers

DS P

SS P

Real Time Bidding (RTB)

Page 5: Hadoop meetup zhemzhitsky

TRACKING DATA

cleverdata.ru | [email protected]

publishers

COOKIE SYNCs ACCESS LOGS

PARTNER’S DATA 3rd PARTY DATA CLICK STREAMS

advertisers

SS P

DS P

DMP

Data Management Platform (DMP)

Page 6: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

3rd party data

Relational Data Store

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Типовые потоки данных

Page 7: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Типовые потоки данных :: RTB

3rd party data

Relational Data Store

RTB

SRV

Exchange SSP

bid req. bid resp.

pixels :: impressions :: clicks

bid requests

user profiles

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Page 8: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

1st-party data

3rd party data

Relational Data Store

RTB

SRV

Exchange SSP

bid req. bid resp.

pixels :: impressions :: clicks

bid requests

user profiles

raw data 3rd party data

3rd party data

Raw Data Store & Processing

RealTime Data Store

user profiles aggregates

Page 9: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

1st-party data

• Зачем монетизировать?

• Как монетизировать?

• Чем монетизировать?

Page 10: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Зачем монетизировать?

Найти всех пользователей, которые участвовали в рекламной кампании “Star Wars” [и] видели один из баннеров “Darth Vader” или “Luke Skywalker”

в течении последних 6 дней [и] кликнули на этот баннер [и] посетили страницу покупки светового меча Darth’а Vader’а [и] но так ничего и не купили

Для того, чтобы сделать ретаргетинг персонифицированным баннером со скидкой на меч в 40%

Page 11: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

find all users who have taken part in campaign[s] “Star Wars” [and] viewed banner[s] “Darth Vader” or “Luke Skywalker”

during [last] 6 day[s] [and] clicked banner[s] “Darth Vader's lightsaber” [and] visited buying area of “Darth Vader's lightsaber” [and] not visited order confirmed area of “Darth Vader's lightsaber”

Как монетизировать?

[impression]

[click] [tr. pixel] [tr. pixel]

id cookie event_id event_type campaign_id timestamp …

1 c1 “Darth Vader” impression “Star Wars” 2015-04-20 14:25:11.462 … 2 c1 “Darth Vader's lightsaber” click “Star Wars” 2015-04-21 06:31:12.157 … 3 c1 “Darth Vader's lightsaber” tr. pixel “Star Wars” 2015-04-22 18:57:19.628 …

[cookies]

Page 12: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Как монетизировать?

reduce find all users who have

taken part in campaign[s] “Star Wars”

viewed banner[s] “Darth Vader” or “Luke Skywalker” during [last] 6 day[s]

clicked banner[s] “Darth Vader's lightsaber”

visited buying area of “Darth Vader's lightsaber”

not visited order confirmed area of “Darth Vader's lightsaber”

(c1, 0)

(c1, 1)

(c1, 2)

(c1, 3)

Ø

map

(c1, 0;1;2;3)

true(0) and true(1) and true(2) and true(3) and not false(4)

C1

Page 13: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

VS.

Page 14: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Правда жизни

• Стильно;

• Модно;

• Молодежно.

Page 15: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Spark :: Размер

Page 16: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Перед тем, как смотреть на Hadoop

Page 17: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Map-Reduce :: Размер

Page 18: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Материалы и инструменты

Hardware (3 Nodes) • 12 Core AMD Opteron™ 6338P

~ 2.8 GHz • 64 GB RAM • 1 GBPS NICs

Software • CDH 5.3.1 (Hadoop 2.5.0) • Spark 1.2.0

Data • 14.2 GB of raw data • 61.1 M of transactions • 128 MB block size

Page 19: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Время выполнения

Page 20: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Spark :: Exec-cores vs Num-execs

Page 21: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Инициализация

MR

9 protected void setup(Context ctx) 9 o.a.h.c.Configured 9 distributed cache

Spark

9 mapRegion 9 broadcast vars

Page 22: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Параллелизм

MR

9 mapred.reduce.tasks 9 mapreduce.job.reduces 9 splittable formats

Spark

9 spark.default.parallelism 9 num-executors, executor-cores in

yarn 9 numTasks в groupByKey,

reduceByKey, aggregateByKey…

Page 23: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Зависимости

MR

9 o.a.h.u.Tool 9 o.a.h.u.ToolRunner 9 -conf app.conf 9 -files 9 -libjars 9 setUserClassesTakesPrecedence

Spark

9 --jars 9 --files 9 --conf 9 --driver-java-options 9 spark.driver.extraJavaOptions 9 spark.executor.extraJavaOptions 9 spark.driver.userClassPathFirst 9 spark.executor.userClassPathFirst

Page 24: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Secondary Sort

MR

9 setSortComparatorClass 9 setGroupingComparatorClass 9 setPartitionerClass

Spark

9 repartitionAndSortWithinPartitions 9 mapPartitions 9 Entire partition processing result

must be able to fit in memory

Page 25: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

MR vs Spark :: Тестирование

MR

9 MRUnit 9 o.a.h.h.MiniDFSCluster 9 o.a.h.m.MiniMRCluster 9 o.a.h.y.s.MiniYARNCluster 9 o.a.h.m.v2.MiniMRYarnCluster

Spark

9 Local executor

Page 26: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Что дальше и почему Spark?

• Spark Streaming;

• Micro Batches;

• λ-архитектура.

без серьезного хирургического вмешательства

Page 27: Hadoop meetup zhemzhitsky

cleverdata.ru | [email protected]

Спасибо за вопросы!

Page 28: Hadoop meetup zhemzhitsky

[email protected] :: [email protected]

cleverleaf.co.uk :: cleverdata.ru

1dmp.io :: crawler.1dmp.io

facebook.com/CleverData :: +7 (495) 967-66-50