Realtime Analytics with Apache Cassandra
Post on 24-Jan-2015
2348 Views
Preview:
DESCRIPTION
Transcript
Realtime Analytics with Apache Cassandra
Tom WilkieFounder & CTO, Acunu Ltd
@tom_wilkie
Analytics2
101• BigTable-style datamodel combined with
Dynamo-style consistency
• Simple queries - put, get, range queries
•Multi-master architecture: no SPOF
• Tunable consistency, multi-DC aware
•Optimised for random writes & random range queries
• Atomic counters, wide rows, composite keys
Analytics
Live & historicalaggregates... Trends... Drill downs
and roll ups
Combining “big” and “real-time” is hard
3
Analytics4
Solution Con
Scalability$$$
Not realtime
Spartan query semantics => complex, DIY solutions
Analytics
Example I
eg “show me the number of mentions of ‘Acunu’ per day, between May and
November 2011, on Twitter”
Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2
TB of datahttp://blog.twitter.com/2011/03/numbers.html
5
Analytics
Okay, so how are we going to do it?
For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters
6
Analytics
Preparing the data
Step 1: Get a feed of the tweets
Step 2: Tokenise the tweet
Step 3: Increment countersin time buckets for each token
12:32:15 I like #trafficlights12:33:43 Nobody expects...
12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!
[1234, man] +1[1234, acunu] +1[1234, rock] +1
7
Analytics
Querying
Step 1: Do a range query
Step 2: Result table
Step 3: Plot pretty graph
start: [01/05/11, acunu]end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
0
45
90
May Jun Jul Aug Sept Oct Nov
8
Analytics9
k4
k1k3
k2
Cassandra keys distributed based on hash or row key, ie randomly
Analytics
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
Instead of this...
We do thisKey 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ time bucket
Column key is ‘small’ time bucket10
Analytics11
Towards a more general solution...
(Example II)
Analytics
countgrouped by ...
daycount
distinct (session)
count ... geography
... browseravg(duration)
12
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
13
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :00→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
14
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
15
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
16
where time 21:00-22:00count(*)
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
17
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
18
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
19
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
20
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
group all by geo
Analytics21
What about more thanjust aggregates?
Analytics
Approximate Analytics
Exact
Large ScaleReal-time
22
Analytics
Count Distinct
Plan A: keep a list of all the things you’ve seen count them at query time
Quick to update ... but at scale ...Takes lots of spaceTakes a long time to query
23
Analytics
Approximate Distinct
xitem
00101001110...
hash max so far
22leading zeroes
y 11010100111... 0 2z 00011101011... 3 3
...
max # leading zeroes seen so far
... to see a max of M takes about 2M items
24
Analytics
Approximate Distinct
to reduce var, average over m=2k sub-streams
xitem
00101001110...
hash
0, 0
index, zeroes max so far
0,0,0,0y 11010100111... 3, 1 0,0,0,1z 00011101011... 0, 1 1,0,0,1
...
take the harmonic mean
25
Analytics
Okay... now what?
Analytics
• Aggregate incrementally, on the fly• Store live + historical aggregates
events
counterupdates
Acunu Analytics
Click streamSensor data
etc
Analytics
10x vs MySQL...
Analytics29
Dashboard UI
Analytics
“Up and running in about 4 hours”
“We found out a competitor was scraping our data”
“We keep discovering use cases we hadn’t thought of ”
http://vimeo.com/54026096
Analytics
"We're still finding new and interesting use cases, which just aren't possible with our
current datastores."
"Quick, efficient and easy to get started"
Analytics
Thanks!
Questions?
http://www.acunu.com/downloadcontact@acunu.com
32
top related