Top Banner
1 Yahoo’s Next Generation User Profile Platform Kai Liu, Lu Niu Yahoo Inc.
34

Yahoo’s next generation user profile platform

Apr 14, 2017

Download

Kai Liu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Yahoo’s next generation user profile platform

1

Yahoo’s Next Generation User Profile Platform

Kai Liu, Lu Niu

Yahoo Inc.

Page 2: Yahoo’s next generation user profile platform

2

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization- Future Work

Page 3: Yahoo’s next generation user profile platform

3

Agenda

- What is User Profile - Definition- Use Cases- Logical View- User ID Type

- Architecture Evolution - Schema Design- Optimization- Future Work

Page 4: Yahoo’s next generation user profile platform

4

What is User ProfileA User Profile is a visual display of personal data associated with a specific user.(Wikipedia)

Page 5: Yahoo’s next generation user profile platform

5

Use Cases

Page 6: Yahoo’s next generation user profile platform

6

Logical View

Page 7: Yahoo’s next generation user profile platform

7

User ID Type

- Desktop- BID: for anonymous users- SID: for registered users

- Mobile- IDFA: for iOS devices- GPSAID: for Android devices

Page 8: Yahoo’s next generation user profile platform

8

Agenda

- What is User Profile - Architecture Evolution

- Old architecture- Problems- New architecture

- Schema Design- Optimization- Future Work

Page 9: Yahoo’s next generation user profile platform

9

Classic Architecture of Data System

Data Preparation

(ETL)

Computation(Hadoop)

Deep Storage(HDFS)

Page 10: Yahoo’s next generation user profile platform

10

Old Architecture

HDFS(full) Hive

AggregationETLBatch Data(daily, hourly,

minutely)

Ad Serving

HDFS(incre)

1 day 1 day Modeling

Insights

Page 11: Yahoo’s next generation user profile platform

11

Problems

- Aggregation is very expensive- HDFS follows Write Once Read Many approach. - Actually only ~30% of users get updates every day.

- Impossible to support multiple update frequencies- Lack of capability to process event stream

Page 12: Yahoo’s next generation user profile platform

12

- Spark- Fast- Consistent stack (batch/streaming)

- HBase- Random read/write capabilities- Flexible schema

- Hive- Large scale ad-hoc query engine- SQL like interface

New Architecture Components

Page 13: Yahoo’s next generation user profile platform

13

New Architecture

HBase

Hive

HDFS

Kafka

Batch Data

Stream Data

10 mins - 1 day

1 sec - 10 minsAd Serving

Spark Batch

Spark Streaming

Modeling

Insights

Page 14: Yahoo’s next generation user profile platform

14

How problems get solved

- Incremental updates avoid full data load. - Multiple Spark jobs with different frequencies running

concurrently.- Spark streaming for event stream processing.

Page 15: Yahoo’s next generation user profile platform

15

Agenda

- What is User Profile - Architecture Evolution - Schema Design

- Understand the data- Table design

- Optimization- Future Work

Page 16: Yahoo’s next generation user profile platform

16

Understand the data

* (1) Ad serving; (2) User Modeling; (3) Audience Insights

Split user profile into multiple HBase tables.

Data Type Update Pattern Use Cases

Properties K/V pairs Overwrite (1)(2)(3)

Events Time Series Append only (3)

Segments List of K/V pairs Read-Modify-Write (1)(3)

Features Hybrid Overwrite + Read-Modify-Write (1)(2)

Page 17: Yahoo’s next generation user profile platform

17

HBase Data Model

Page 18: Yahoo’s next generation user profile platform

18

Table Design - Properties

Row KeyColumn Family: Properties

c: age c: gender c: device1 c: device2 …...

0_284386766

1_1877933007

id_type + user_id

val 1 val 2 val 3

Page 19: Yahoo’s next generation user profile platform

19

Table Design - Events

Row KeyColumn Family: Events

c: event

0_284386766_1463848639

0_284386766_1463935039

id_type + user_id + event_type + timestamp

valueRows are sorted by timestamp

Page 20: Yahoo’s next generation user profile platform

20

Table Design - Segments

Row KeyColumn Family: Segments

c: type1 c: type2 c: type3 …...

0_284386766

1_1877933007

id_type + user_id

* Different segments in different column to avoid atomic operation

value

Page 21: Yahoo’s next generation user profile platform

21

Properties Events

Query “Get age, gender of user A” “Get events of user A from 05/21/2016 to 05/22/2016”

Write Pattern ❏ Write only❏ Keep multiple versions

❏ Append only❏ Use TTL to auto-remove records

Rollback❏ Set TIMERANGE to

fetch last version in application layer

❏ Filtered out bad records in application layer

❏ Deletion based on timestamp if necessary

Different Access Patterns

Page 22: Yahoo’s next generation user profile platform

22

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization

- Pre-split tables- Pre-aggregation in Spark- Lazy aggregation for inactive users- Sequential read on Hive

- Future Work

Page 23: Yahoo’s next generation user profile platform

23

Pre-Split Tables

Page 24: Yahoo’s next generation user profile platform

24

Pre-Split Tables- Data Skew: User data is not evenly distributed across different id types - Pre-split tables based on data distribution

{SPLITS =>["\x00\x00\x00\x01\x50", "\x00\x00\x00\x01\xA0", "\x00\x00\x00\x02\x00", "\x00\x00\x00\x02\x40", "\x00\x00\x00\x02\x80", "\x00\x00\x00\x02\xC0", , "\x00\x00\x00\x03\x00", "\x00\x00\x00\x04\x00"]

}

Page 25: Yahoo’s next generation user profile platform

25

- 1 Billion native ads events per day on 0.1 Billion users - Group by (user id, time interval)- Reduce the writes by 10X

Pre-Aggregate events in Spark

Page 26: Yahoo’s next generation user profile platform

26

Pre-Aggregate features in Spark- 5 Billion app activities per day on 0.5 Billion devices- 1 Billion search keywords per day on 0.06 Billion devices- Aggregate on user id for both features. One Spark job instead of two.

Page 27: Yahoo’s next generation user profile platform

27

Lazy aggregation for inactive users

- Problem: read-modify-write is expensive- Facts:

- A large portion of the users might not be accessed frequently- Update jobs are not evenly distributed over time

- Solution: Lazy aggregation for inactive users

Page 28: Yahoo’s next generation user profile platform

28

- Maintain a set of users as active users- Active users

- read-modify-write- Inactive users

- Append updates only- Merging updates:

- Batch job- Upon request

Lazy aggregation for inactive users

Spark

r-m-w

w

HBase

r-m-w

Active Users

Inactive Users

update1

update2

Page 29: Yahoo’s next generation user profile platform

29

Sequential read on Hive

- HBase to Hive- Sync data to Hive using HBase snapshots without

impact Region Servers. - Hive access the data using HBaseStorageHandler.

- Move sequential reads to Hive- User modeling- Audience insights

Page 30: Yahoo’s next generation user profile platform

30

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization- Future Work

Page 31: Yahoo’s next generation user profile platform

31

Future Work

- Explore Impala/Presto for better query performance; - Expose API for incremental modeling capability.

Page 32: Yahoo’s next generation user profile platform

32

Questions?

Page 33: Yahoo’s next generation user profile platform

33

Appendix

Page 34: Yahoo’s next generation user profile platform

34

More optimization- Less column family as possible- Turn off autoflush- Throttling writes if necessary- Compress data before sending to Hbase- Kryo for serialization