Top Banner
S2Graph : A large-scale graph database with Hbase daumkakao
59

S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

Apr 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

S2Graph : A large-scale graph database

with Hbase

daumkakao

Page 2: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

2

Reference

1. HBase Conference 20151.http://www.slideshare.net/HBaseCon/use-cases-session-52.https://vimeo.com/128203919

2. Deview 20153. Apache Con BigData Europe

1.http://sched.co/3ztM4. Github: https://github.com/daumkakao/s2graph

Page 3: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

3

Our Social Graph

Message

Writelength :

Read

Couponprice :

Presentprice : 3

affinity affinity:

affinity

affinity

affinity

affinity

affinity

affinity

affinity

Friend

Groupsize : 6

Emoticon

Eatrating :

Viewcount :

Playlevel: 6

Styleshare : 3

Advertise

Searchkeyword :

Listencount :

Likecount : 7

Comment

affinity

Page 4: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

4

Our Social Graph

Messagelength : 9

Writelength : 3

affinity 6affinity: 9

affinity 3

affinity 3

affinity 4

affinity 1

affinity 2

affinity 2

affinity 9

Friend

Playlevel: 6

Styleshare : 3

Advertisectr : 0.32

Searchkeyword : “HBase"

Listencount : 6

Commentlength : 15

affinity 3

Message ID : 201

Ad ID : 603Music ID : 603

Item ID : 13

Post ID : 97

Game ID : 1984

Page 5: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

5

Technical Challenges

1. Large social graph constantly changing

a. Scale

more than,social network: 10 billion edges, 200 million vertices, 50 million update on existing edges.user activities: over 1 billion new edges per day

Page 6: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

6

Technical Challenges (cont)

2. Low latency for breadth first search traversal on connected data.

a. performance requirement

peak graph-traversing query per second: 20000response time: 100ms

Page 7: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

7

Technical Challenges (cont)

3. Realtime update capabilities for viral effects

Person A

PostFast Person B

Comment

Person C

Sharing

Person D

MentionFast Fast

Page 8: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

8

Technical Challenges (cont)

4. Support for Dynamic Ranking logic

a. Push strategy: Hard to change data ranking logic dynamically.

b. Pull strategy: Enables user to try out various data ranking logics.

Page 9: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

9

Before

Each app server should know each DB’s sharding logic. Highly inter-connected architecture

Friend relationship SNS feeds Blog user activities Messaging

Messaging App

SNS App

Blog App

Page 10: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

10

After

SNS App

Blog App

Messaging App

S2Graph DBstateless app servers

Page 11: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

What is S2Graph?

daumkakao

Page 12: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

12

What is S2Graph?

Storage-as-a-Service + Graph API = Realtime Breadth First Search

Page 13: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

13

Example: Messanger Data Model

Participates

Chat Room

Message 1

Message 1

Message 1

Contains

Recent messages in my chat rooms.SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id = b.chat_room_id WHERE b.created_at >= yesterday

Page 14: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

14

Example: Messanger Data Model

Participates

Chat Room

Message 1

Message 1

Message 1

Contains

Recent messages in my chat rooms.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step

[{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]

]

}

'

Page 15: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

15

Example: News Feed (cont)

FriendsPost1

Post 2

Post 3

create/like/share posts

Posts that my friends interacted.SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and b.action_type in (‘create’, ‘like’, ‘share’)

Page 16: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

16

Example: News Feed (cont)

FriendsPost1

Post 2

Post 3

create/like/share posts

Posts that my friends interacted.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": "friends", "direction": "out", "limit": 100}], // step

[{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]

]

}

'

Page 17: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

17

Example: Recommendation(User-based CF) (cont)

Similar UsersProduct 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that similar user interact recently.SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday

Batch

Page 18: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

18

Example: Recommendation(User-based CF) (cont)

Products that similar user interact recently.curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

“filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}],

“steps”: [[{“label”: “user_products_interact”}]]

},

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step

[{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}]

]

}

'

Similar UsersProduct 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Batch

Page 19: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

19

Example: Recommendation(Item-based CF) (cont)

Similar Products

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Product 1

Product 1

Product 1

Products that are similar to what I have interested.SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday

Batch

Page 20: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

20

Example: Recommendation(Item-based CF) (cont)

Products that are similar to what I have interested.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}]

]

}

'

Similar Products

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Product 1

Product 1

Product 1

Batch

Page 21: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

21

Example: Recommendation(Content + Most popular) (cont)

TopK(k=1) product per timeUnit(day)

Product1

Product2

Product 3

user-product interaction(click/buy/like/share)

Daily top product per categories in products that I liked.SELECT c.*FROM user_products a, product_categories b, category_daily_top_products cWHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday, today)

Category1

Category2

Product10

Product20

Product20

Today

Product10 Yesterday

Today

Yesterday

Page 22: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

22

Example: Recommendation(Content + Most popular) (cont)

Daily top product per categories in products that I liked.curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{“label”: “product_cates”, “direction”: “out”, “limit”: 3}],

[{"label": “category_products_topK”, "direction": "out", "limit": 10]

]

}

'

TopK(k=1) product per timeUnit(day)

Product1

Product2

Product 3

user-product interaction(click/buy/like/share)

Category1

Category2

Product10

Product20

Product20

Today

Product10 Yesterday

Today

Yesterday

Page 23: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

23

Example: Recommendation(Spreading Activation) (cont)

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that is interacted by users who interacted on products that I interactSELECT b.product_id, count(*)FROM user_products a, user_products bWHERE a.user_id = 1AND a.product_id = b.product_idGROUP BY b.product_id

Page 24: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

24

Example: Recommendation(Spreading Activation) (cont)

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that is interacted by users who interacted on products that I interactcurl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}],

[{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}],

]

}

'

Page 25: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

25

Realization

1. These examples resemble graphs.2. Object isVertex, Relationship is Edge.3. Necessary APIs: breadth first search on large scale graph.

Page 26: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

26

S2Graph API: Vertex

Vertex:1. insert, delete, getVertex2. vertex id: what user

provided(string/int/long)

ID 1231-123Prop1 Val1Prop2 Val2

… …

Page 27: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

27

S2Graph API: Edge

Edges:

1. Insert, delete, update, getEdge(like CRUD in RDBMS)

2. Edge reference: (from, to, label, direction)

3. Multiple props on edge.4. Every edges are ordered (details

follow).Edge Reference 1,101,”friend”,”out”

Prop1 Val1Prop2 Val2

… …

Page 28: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

28

S2Graph API: indices

Degree Q1 Q2 Q3

1-friend-out-PK 3 c-103 b-102 a-101

1

101

102

103

Name: a

Name: b

Name: c

Ordered(DESC)

Indices:1. addIndex, createIndex2. Automatically keep edges ordered for

multiple indices.3. Support int/long/float/string data

types.

class Index { // define how to order edges. String indexName; List[Prop] indexProps;}

Page 29: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

29

S2Graph API: Query

Query: getEdges, countEdges, removeEdges

Class Query {// Define breadth first search

List[VertexId] startVertices; List[Step] steps;}Class Step { // Define one breadth List[QueryParam] queryParams;}Class QueryParam { // Define each edges to traverse for current breadth String label; String direction; Map options;}

QueryParam

Step1 Step2

Query

Page 30: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

30

What is S2Graph

Not support global computation(not like Apache Giraph, graphX). Not support graph algorithm like page rank, shortest path.

Storage-as-a-Service + Graph API = Realtime Breadth First Search

S2Graph is Not

Page 31: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

31

Why S2Graph: Push vs Pull. Feeds with Push

1. Only timestamp can be used as scoring2. Hard to change scoring function dynamically

PostLike

Write(Fanout)Friends Feed Queue

Feed Queue

Feed Queue

Write # of friendsRead O(1) for friends

Storage AVG(# of friends) * total user activityQuery O(1)

Page 32: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

32

1.Different weights to different action types: Like = 0.8, Click = 0.1…2.Client can change scoring dynamically.

PostLike

Friends

Why S2Graph: Push vs Pull. Feeds with Pull

Write O(1)Read None

Storage total user activityQuery O(1) for friends + O(# of friends)

Page 33: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

33

Pull >> push only if

1. fast response time: 10 ~ 100ms2. throughput: 10K ~ 20K QPS

S2Graph provide linear scalability on

1. number of machine.2. bfs search space(how many edges that single query will traverse).

more detail on benchmark section later.

Why S2Graph: S2Graph Supports Pull + Push

Page 34: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

34

Why S2Graph: Simplify Data Flow

S2Graph

Write API + Query DSL

WAL log

OpenSourced

User/Item Similarity

Apache Spark (Batch Computing Layer)

TopK Counter Others

S2Graph Bulk Loader

will be open sourced soon

Page 35: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

35

Why S2Graph: Built in A/B test

1. Register Query Template: Each Query template have impressionId.2. Insert Click/Impression event into S2Graph as Edge insert.

Page 36: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

36

Why S2Graph: Just Insert Edge

S2Graph

1. user activity history. 2. friends feed.3. user-item based collaborative filtering.4. topK ranking(most popular, segmented most popular).

and many many more.just think your service as graph model.

Page 37: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

S2Graph Internal

daumkakao

Page 38: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

38

How to store the data - EdgeLogical View

a. Fetching an edge between two specific vertexb. Lookup Table to reach indexed edges for update, increment, delete operations

1. Snapshot edges : Up-to-date status of edge

Tgt Vertex ID1 Tgt Vertex ID2 Tgt Vertex ID3

Src Vertex ID1 Properties Properties Properties

Src Vertex ID2 Properties Properties Properties

2. Indexed edges : Edges with index

Index Values | Tgt Vertex ID1 Index Values | Tgt Vertex ID2

Src Vertex ID1 Non-index Properties Non-index Properties

a. Fetches edges originating from a certain vertex in order of index

Page 39: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

39

How to store the data - VertexLogical View

1. Vertex : Up-to-date status of Vertex

columnrow

Property Key1 Property Key2

Src Vertex ID1 Value1 Value2

Vertex ID2 Value1 Value2

Page 40: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

40

Problem

Update/Delete edge is hard.

- it is not feasible to traverse all edges to find edges to delete.- indexedEdge is ordered, so in worst case, client need to fetch all edges to find edge to delete.- this make delete/update edge become O(N).

Backtracking from snapshotEdge- read snapshot edge, then random access to indexed edge to delete. O(1)

Problem- need atomicity on update snapshot edge and indexed edge.- this require atomic operation(transaction) on multiple rows on HBase.- some times partial failure on this update yield broken states, such as zombie edges.

Page 41: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

41

How to update edge

IndexedEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 c-103 b-102 a-101

age:30, gender:M age:21 age:15, gender:F

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101

name:c:t0 age:30:t0

gender:M:t0name:b:t0 age:21:t0

name:a:t0 age:15:t0

gender:F:t0

curl -XPOST localhost:9000/graphs/edges/insert -H ‘Content-Type: Application/json’ -d ‘[ {“timestamp”: t0, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “a”, “age”: 15, “gender”: “F”}}, {“timestamp”: t0, “from”: 1, “to”: 102, “label”: “friend”, “props”: {“name”: “b”, “age”: 21}}, {“timestamp”: t0, “from”: 1, “to”: 103, “label”: “friend”, “props”: {“name”: “c”, “age”: 30, “gender”: “M”}]‘

Page 42: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

42

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101

name:c:t0 age:30:t0

gender:M:t0

name:b:t0 age:21:t0

name:a:t0 name:d:t1 age:15:t0 age:26:t1

gender:F:t0

IndexedEdge: delete(1, (a-101)) insert(1, (d-101))

curl -XPOST localhost:9000/graphs/edges/update -H ‘Content-Type: Application/json’ -d ‘[ {“timestamp”: t-1, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “k”, “age”: -10}} {“timestamp”: t1, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “d”, “age”: 26}}]‘

How to update edge (cont)

1.Fetch SnapshotEdge2.check pending mutations and

retry 3.Build Update on Snapshot/

Indexed Edge4.CAS on new SnapshotEdge5.Mutate indexedEdge6.CAS on new SnapshotEdge

if pending mutations exist, other thread mutate this, so commit pending mutations and retry. If CAS is failed at 4, other thread lock this, so retry.

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101name:c:t0 age:30:t0

gender:M:t0name:b:t0 age:21:t0

name:a:t0 age:15:t0

gender:F:t0

Page 43: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

43

IndexedEdge Degree Q0 Q1 Q2 Q3

1-friend-out-PK

3 d-101 c-103 b-102 a-101

age:26,gender:F age:30, gender:M age:21 age:15, gender:F

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101

name:c:t0 age:30:t0

gender:M:t0

name:b:t0 age:21:t0

name:a:t0 name:d:t1 age:15:t0 age:26:t1

gender:F:t0

IndexedEdge: delete(1, (a-101)) insert(1, (d-101))

1.Fetch SnapshotEdge2.Apply mutations stored in

SnapshotEdge if exist3.Build Update on Snapshot/

Indexed Edge4.CAS on new SnapshotEdge5.Mutate indexedEdge6.CAS on new SnapshotEdge

If any failure exist on 5, abort and retry from 1. it is safe to issue same mutation multiple time since s2graph is idempotent.

How to update edge (cont)

Page 44: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

44

IndexedEdge Degree Q0 Q1 Q2

1-friend-out-PK 3 d-101 c-103 b-102

age:26,gender:F age:30, gender:M age:21

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101

name:c:t0 age:30:t0

gender:M:t0

name:b:t0 age:21:t0

name:d:t1 age:26:t1

gender:F:t01.Fetch SnapshotEdge2.Apply mutations stored in

SnapshotEdge if exist3.Build Update on Snapshot/

Indexed Edge4.CAS on new SnapshotEdge5.Mutate indexedEdge6.CAS on new SnapshotEdge

If CAS is failed at 6, retry from 1

How to update edge (cont)

Page 45: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

45

1.Fetch SnapshotEdge2.Build Update on Snapshot/

Indexed Edge3.CAS on new SnapshotEdge4.Mutate indexedEdge5.CAS on new SnapshotEdge

IndexedEdge Degree Q0 Q1 Q2

1-friend-out-PK 3 d-101 c-103 b-102

age:26,gender:F age:30, gender:M age:21

SnapshotEdge Degree Q1 Q2 Q3

1-friend-out-PK 3 103 102 101

name:c:t0 age:30:t0

gender:M:t0

name:b:t0 age:21:t0

name:d:t1 age:26:t1

gender:F:t0

How to update edge (cont)

Page 46: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

46

Summary

No atomic update between SnapshotEdge and IndexedEdge. - this can yield failure on update/delete on edge, and this is problem.

Then should we use transaction?

Pros: solve our problem. no failure on update/delete edge.Cons: need one more read even though there is no contention.

The probability of contention on edge is very low. stick to our retry logic with CAS.

Page 47: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

Benchmarks

daumkakao

Page 48: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

48

HBase Table Configuration

1. setDurability(Durability.ASYNC_WAL)

2. setCompressionType(Compression.Algorithm.LZ4)

3. setBloomFilterType(BloomType.Row)

4. setDataBlockEncoding(DataBlockEncoding.FAST_DIFF)

5. setBlockSize(32768)

6. setBlockCacheEnabled(true)

7. pre-split by (Intger.MaxValue / regionCount). regionCount = 120 when create table(on 20 region server).

Page 49: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

49

HBase Cluster Configuration

• each machine: 8core, 32G memory, SSD

• hfile.block.cache.size: 0.6

• hbase.hregion.memstore.flush.size: 128MB

• otherwise use default value from CDH 5.3.1

• s2graph rest server: 4core, 16G memory

Page 50: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

50

Performance

1. Total # of Edges: 100,000,000,000(100,000,000 row x 1000 column)2. Test environment

a. Zookeeper server: 3b. HBase Masterserver: 2c. HBase Regionserver: 20d. App server: 4 core, 16GB Rame. Write traffic: 5K / second

Page 51: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

51

- Benchmark Query : src.out(“friend”).limit(100).out(“friend”).limit(10)- Total concurrency: 20 * # of app server

Performance

2. Linear scalability

Late

ncy

0

50

100

150

200

QPS

0

1,000

2,000

3,000

4,000

# of app server1 2 4 8

QPS(Query Per Second) Latency(ms)

46454543

3,491

1,763

88546443 45 45 46

# of app server1 2 3 4 5 6 7 8

500

1000

1500

2000

2500

3000

QPS

Page 52: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

Performance

3. Varying width of traverse (tested with a single server)

Late

ncy

0

87.5

175

262.5

350

QPS

0

500

1,000

1,500

2,000

Limit on first step20 40 80 200 400 800

QPS Latency(ms)

327

164

84351911 61122237

570

1,023

1,821

11 19 3584

164

327

- Benchmark Query : src.out(“friend”).limit(x).out(“friend”).limit(10)- Total concurrency = 20 * 1(# of app server)

Page 53: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

53

- All query touch 1000 edges.- each step` limit is on x axis.- Can expect performance with given query`s search space.

Performance

4. Different query path(different I/O pattern)

Late

ncy

0

37.5

75

112.5

150

QPS

0

80

160

240

320

400

limits on path10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10

QPS Latency(ms)

3234362314

307.5292.1274.4

435.3695

14 23 36 34 32

Page 54: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

54

Performance

5. Write throughput per operation on single app server

Insert operation

Late

ncy

0

1.25

2.5

3.75

5

Request per second

8000 16000 800000

Page 55: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

55

Performance

6. write throughput per operation on single app server

Update(increment/update/delete) operation

Late

ncy

0

2

4

6

8

Request per second

2000 4000 6000

Page 56: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

56

Stats

1. HBase cluster per IDC (2 IDC)- 3 Zookeeper Server- 2 HBase Master- (20 + 40) HBase Slave

2. App server per IDC- 10 server for write-only- 30 server for query only

3. Real traffic- read: 10K ~ 20K request per second

- now mostly 2 step queries with limit 100 on first step.- write: over 5k ~ 10k request per second

Page 57: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

57

Page 58: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

58

Through S2Graph !

Page 59: S2Graph : A large-scale graph d atabase · Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products

59

Now Available As an Open Source- https://github.com/daumkakao/s2graph- Finding contributors and mentors

Contact- Doyoung Yoon : [email protected]