Top Banner
Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
32

Realtime Analytics with Apache Cassandra

Jan 24, 2015

Download

Documents

Acunu

The latest version of my talk, as given at the NoSQL Roadshow Amasterdam, 29th Nov 2012
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Realtime Analytics with Apache Cassandra

Realtime Analytics with Apache Cassandra

Tom WilkieFounder & CTO, Acunu Ltd

@tom_wilkie

Page 2: Realtime Analytics with Apache Cassandra

Analytics2

101• BigTable-style datamodel combined with

Dynamo-style consistency

• Simple queries - put, get, range queries

•Multi-master architecture: no SPOF

• Tunable consistency, multi-DC aware

•Optimised for random writes & random range queries

• Atomic counters, wide rows, composite keys

Page 3: Realtime Analytics with Apache Cassandra

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

3

Page 4: Realtime Analytics with Apache Cassandra

Analytics4

Solution Con

Scalability$$$

Not realtime

Spartan query semantics => complex, DIY solutions

Page 5: Realtime Analytics with Apache Cassandra

Analytics

Example I

eg “show me the number of mentions of ‘Acunu’ per day, between May and

November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2

TB of datahttp://blog.twitter.com/2011/03/numbers.html

5

Page 6: Realtime Analytics with Apache Cassandra

Analytics

Okay, so how are we going to do it?

For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

6

Page 7: Realtime Analytics with Apache Cassandra

Analytics

Preparing the data

Step 1: Get a feed of the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

7

Page 8: Realtime Analytics with Apache Cassandra

Analytics

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

8

Page 9: Realtime Analytics with Apache Cassandra

Analytics9

k4

k1k3

k2

Cassandra keys distributed based on hash or row key, ie randomly

Page 10: Realtime Analytics with Apache Cassandra

Analytics

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

Instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket10

Page 11: Realtime Analytics with Apache Cassandra

Analytics11

Towards a more general solution...

(Example II)

Page 12: Realtime Analytics with Apache Cassandra

Analytics

countgrouped by ...

daycount

distinct (session)

count ... geography

... browseravg(duration)

12

Page 13: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

13

Page 14: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :00→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

14

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

Page 15: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

15

Page 16: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

16

where time 21:00-22:00count(*)

Page 17: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

17

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

Page 18: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

18

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

Page 19: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

19

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

Page 20: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

20

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

group all by geo

Page 21: Realtime Analytics with Apache Cassandra

Analytics21

What about more thanjust aggregates?

Page 22: Realtime Analytics with Apache Cassandra

Analytics

Approximate Analytics

Exact

Large ScaleReal-time

22

Page 23: Realtime Analytics with Apache Cassandra

Analytics

Count Distinct

Plan A: keep a list of all the things you’ve seen count them at query time

Quick to update ... but at scale ...Takes lots of spaceTakes a long time to query

23

Page 24: Realtime Analytics with Apache Cassandra

Analytics

Approximate Distinct

xitem

00101001110...

hash max so far

22leading zeroes

y 11010100111... 0 2z 00011101011... 3 3

...

max # leading zeroes seen so far

... to see a max of M takes about 2M items

24

Page 25: Realtime Analytics with Apache Cassandra

Analytics

Approximate Distinct

to reduce var, average over m=2k sub-streams

xitem

00101001110...

hash

0, 0

index, zeroes max so far

0,0,0,0y 11010100111... 3, 1 0,0,0,1z 00011101011... 0, 1 1,0,0,1

...

take the harmonic mean

25

Page 26: Realtime Analytics with Apache Cassandra

Analytics

Okay... now what?

Page 27: Realtime Analytics with Apache Cassandra

Analytics

• Aggregate incrementally, on the fly• Store live + historical aggregates

events

counterupdates

Acunu Analytics

Click streamSensor data

etc

Page 28: Realtime Analytics with Apache Cassandra

Analytics

10x vs MySQL...

Page 29: Realtime Analytics with Apache Cassandra

Analytics29

Dashboard UI

Page 30: Realtime Analytics with Apache Cassandra

Analytics

“Up and running in about 4 hours”

“We found out a competitor was scraping our data”

“We keep discovering use cases we hadn’t thought of ”

http://vimeo.com/54026096

Page 31: Realtime Analytics with Apache Cassandra

Analytics

"We're still finding new and interesting use cases, which just aren't possible with our

current datastores."

"Quick, efficient and easy to get started"