Top Banner
Martin Zapletal @zapletal_martin Cake Solutions @cakesolutions #CassandraSummit Presented by Anirvan Chakraborty @anirvan_c
57

Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Jan 12, 2017

Download

Software

Martin Zapletal
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Martin Zapletal @zapletal_martin

Cake Solutions @cakesolutions

#CassandraSummit

Presented by Anirvan Chakraborty @anirvan_c

Page 2: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Introduction● Event sourcing and CQRS● An emerging technology stack to handle data● A reference application and it’s architecture● A few use cases of the reference application● Conclusion

Page 3: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Increasing importance of data analytics● Current state

○ Destructive updates○ Analytics tools with poor scalability and integration○ Manual processes○ Slow iterations○ Not suitable for large amounts of data

Page 4: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Whole lifecycle of data

● Data processing● Data stores● Integration and messaging● Distributed computing primitives● Cluster managers and task schedulers● Deployment, configuration management and DevOps● Data analytics and machine learning

● Spark, Mesos, Akka, Cassandra, Kafka (SMACK, Infinity)

Page 5: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

ACID Mutable State

Page 6: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Create, Read, Update, Delete● Exposes mutable internal state● Many read methods on repositories● Mapping of data model and objects (impedance mismatch)● No auditing● No separation of concerns (read / write, command / event)● Strongly consistent● Difficult optimizations of reads / writes● Difficult to scale● Intent, behaviour, history, is lost

Balance = 5

Balance = 10

Update Account

Balance = 10

Account

Page 7: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

[1]

CQRS

Client

QueryCommand

DBDB

Denormalise/Precompute

Kappa architecture

Batch-Pipeline

Kafka

All

you

r d

ata

NoSQL

SQL

Spark

Client

Client

Client Views

Streamprocessor

Flume

ScoopHive

Impala

Oozie

HDFS

Lambda Architecture

Batch Layer Serving Layer

Stream layer (fast)

Query

Query

All

you

r d

ata

Serving DB

Page 8: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

[2, 3]

Page 9: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Append only data store● No updates or deletes (rewriting history)● Immutable data model● Decouples data model of the application and storage● Current state not persisted, but derived. A sequence of updates that led to it.● History, state known at any point in time● Replayable● Source of truth● Optimisations possible● Works well in distributed environment - easy partitioning, conflicts● Helps avoiding transactions● Works well with DDD

Page 10: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

userId date change

1

1

1

10/10/2015

11/10/2015

23/10/2015

+300

-100

-200

1 24/10/2015 +100

balanceChanged

event

balanceChanged

balanceChanged

balanceChanged

Event journal

Page 11: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Command Query Responsibility Segregation● Read and write logically and physically separated ● Reasoning about the application● Clear separation of concerns (business logic)● Often different technology, scalability● Often lower consistency - eventual, causal

Page 12: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Command

● Write side● Messages, requests to mutate state● Behaviour, serialized method call essentially● Don’t expose state● Validated and may be rejected or emit one or more events (e.g. submitting a form)

Event

● Write side● Immutable● Indicating something that has happened● Atomic record of state change● Audit log

Query

● Read side● Precomputed

Page 13: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

userId = 1updateBalance(+100)

Write

Command Event

userId date change

1

1

1

10/10/2015

11/10/2015

23/10/2015

+300

-100

-200

1 24/10/2015 +100

balanceChanged

eventbalanceChanged

balanceChanged

balanceChanged

Event journal

Command handler

Read

balance

1 100

userId = 1balance = 100

Query

userId

Page 14: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Partial order of events for each entity● Operation semantics, CRDTs

UserNameUpdated(B)

UserNameUpdated(B)

UserNameUpdated(A)

UserNameUpdated(A)

Page 15: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Localization● Conflicting concurrent histories

○ Resubmission○ Deduplication○ Replication

● Identifier● Version● Timestamp● Vector clock

Page 16: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 17: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 18: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Actor framework for truly concurrent and distributed systems● Thread safe mutable state - consistency boundary● Domain modelling, distributed state● Simple programming model - asynchronously send messages, create

new actors, change behaviour● Supports CQRS/ES● Fully distributed - asynchronous, delivery guarantees, failures, time

and order, consistency, availability, communication patterns, data locality, persistence, durability, concurrent updates, conflicts, divergence, invariants, ...

Page 19: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

?

?

? + 1

? + 1

? + 2

UserId = 1Name = Bob

BankAccountId = 1Balance = 1000

UserId = 1Name = Alice

Page 20: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Distributed domain modelling● In memory● Ordering, consistency

id = 1

Page 21: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Actor backed by data store● Immutable event sourced journal● Supports CQRS (write and read side)

● Persistence, replay on failure, rebalance, at least once delivery

Page 22: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

user1, event 2

user1, event 3

user1, event 4

user1, event 1

Page 23: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

class UserActor extends PersistentActor {

override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId

override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator)

def notRegistered(distributedData: ActorRef): Receive = { case cmd: AccountCommand => persist(AccountEvent(cmd.account)){ acc => context.become(registered(acc)) sender() ! \/-() } }

def registered(account: Account): Receive = { case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) => persist(eres)(data => sender() ! \/-(id)) }

override def receiveRecover: Receive = { ... }}

Page 24: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Akka Persistence Cassandra journal○ Globally distributed journal○ Scalable, resilient, highly available○ Performant, operational database

● Community plugins

akka {

persistence {

journal.plugin = "cassandra-journal"

snapshot-store.plugin = "cassandra-snapshot-store"

}

}

Page 25: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Partition-size● Events in each cluster partition ordered (persistenceId - partition pair)

CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, partition_nr bigint, sequence_nr bigint, marker text, message blob, PRIMARY KEY ((processor_id, partition_nr), sequence_nr, marker)) WITH COMPACT STORAGE AND gc_grace_seconds = ${config.gc_grace_seconds}

processor_id partition_nr sequence_nr marker message

user-1 0 0 H 0x0a6643b334...

user-1 0 1 A 0x0ab2020801...

user-1 0 2 A 0x0a98020801...

Page 26: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Internal state, moment in time● Read optimization

CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, sequence_nr bigint, timestamp bigint, snapshot blob, PRIMARY KEY (processor_id, sequence_nr)) WITH CLUSTERING ORDER BY (sequence_nr DESC)

processor_id sequence_nr snapshot timestamp

user-1 16 0x0400000001... 1441696908210

user-1 20 0x0400000001... 1441697587765

Page 27: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Uses Akka serialization

0x0a6643b334 …

PersistentRepr

Akka.Serialization

Payload: T

Protobuffactor {

serialization-bindings {

"io.muvr.exercise.ExercisePlanDeviation" = kryo,

"io.muvr.exercise.ResistanceExercise" = kryo,

}

serializers {

java = "akka.serialization.JavaSerializer"

kryo = "com.twitter.chill.akka.AkkaSerializer"

}

}

Page 28: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

class UserActorView(userId: String) extends PersistentView {

override def persistenceId: String = UserPersistenceId(userId).persistenceId

override def viewId: String = UserPersistenceId(userId).persistentViewId

override def autoUpdateInterval: FiniteDuration = FiniteDuration(100, TimeUnit.MILLISECONDS)

def receive: Receive = viewState(List.empty)

def viewState(processedDeviations: List[ExercisePlanProcessedDeviation]): Receive = {

case EntireResistanceExerciseSession(_, _, _, _, deviations) if isPersistent =>

context.become(viewState(deviations.filter(condition).map(process) ::: processedDeviations))

case GetProcessedDeviations => sender() ! processedDeviations

}

}

Page 29: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Akka 2.4● Potentially infinite stream of data● Ordered, replayable, resumable● Aggregation, transformation, moving data

● EventsByPersistenceId● AllPersistenceids● EventsByTag

Page 30: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

val readJournal =

PersistenceQuery(system).readJournalFor(CassandraJournal.Identifier)

val source = readJournal.query(

EventsByPersistenceId(UserPersistenceId(name).persistenceId, 0, Long.MaxValue), NoRefresh)

.map(_.event)

.collect{ case s: EntireResistanceExerciseSession => s }

.mapConcat(_.deviations)

.filter(condition)

.map(process)

implicit val mat = ActorMaterializer()

val result = source.runFold(List.empty[ExercisePlanDeviation])((x, y) => y :: x)

Page 31: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Potentially infinite stream of events

Source[Any].map(process).filter(condition)

Publisher Subscriber

process

condition

backpressure

Page 32: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● In Akka we have the read and write sides separated, in Cassandra we don’t

● Different data model● Avoid using operational datastore● Eventual consistency● Streaming transformations to different format● Unify journalled and other data

Page 33: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Computations and analytics queries on the data● Often iterative, complex, expensive computations● Prepared and interactive queries● Data from multiple sources, joins and transformations● Often directly on a stream of data● Whole history of events● Historical behaviour● Works retrospectively, can answer questions in the future that we don’t

know exist yet● Various data types from various sources● Large amounts of fast data● Automated analytics

Page 34: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Cassandra 3.0 - user defined functions, functional indexes, aggregation functions, materialized views

● Server side denormalization● Eventual consistency● Copy of data with different partitioning

userId

performance

Page 35: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● In memory dataflow distributed data processing framework, streaming and batch

● Distributes computation using a higher level API● Load balancing● Moves computation to data ● Fault tolerant

Page 36: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Resilient Distributed Datasets● Fault tolerance● Caching● Serialization● Transformations

○ Lazy, form the DAG○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ...

● Actions○ Execute DAG, retrieve result○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...

● Accumulators● Broadcast Variables● Integration● Streaming● Machine Learning● Graph Processing

Page 37: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

textFile mapmapreduceByKey

collect

sc.textFile("counts") .map(line => line.split("\t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect()

[4]

Page 38: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Spark master

Spark worker

Cassandra

Page 39: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Cassandra can store● Spark can process

● Gathering large amounts of heterogeneous data● Queries● Transformations● Complex computations● Machine learning, data mining, analytics● Now possible● Prepared and interactive queries

Page 40: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

lazy val sparkConf: SparkConf =

new SparkConf()

.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(sparkConf)

val data = sc.cassandraTable[T]("keyspace", "table").select("columns")

val processedData = data.flatMap(...)...

processedData.saveToCassandra("keyspace", "table")

Page 41: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Akka Analytics project● Handles custom Akka serialization

case class JournalKey(persistenceId: String, partition: Long, sequenceNr: Long)

lazy val sparkConf: SparkConf =

new SparkConf()

.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(sparkConf)

val events: RDD[(JournalKey, Any)] = sc.eventTable()

events.sortByKey().map(...).filter(...).collect().foreach(println)

Page 42: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Spark streaming● Precomputing using spark or replication often aiming for different data

modelOperational cluster Analytics cluster

Precomputation / replication

Integration with other data sources

Page 43: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)

val deviationsFrequency = sqlContext.sql(

"""SELECT planned.exercise, hour(time), COUNT(1)

FROM exerciseDeviations

WHERE planned.exercise = 'bench press'

GROUP BY planned.exercise, hour(time)""")

val deviationsFrequency2 = exerciseDeviationsDF

.where(exerciseDeviationsDF("planned.exercise") === "bench press")

.groupBy(

exerciseDeviationsDF("planned.exercise"),

exerciseDeviationsDF("time”))

.count()

val deviationsFrequency3 = exerciseDeviations

.filter(_.planned.exercise == "bench press")

.groupBy(d => (d.planned.exercise, d.time.getHours))

.map(d => (d._1, d._2.size))

Page 44: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

def toVector(user: User): mllib.linalg.Vector =

Vectors.dense(

user.frequency, user.performanceIndex, user.improvementIndex)

val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()

val users: RDD[User] = events.filterClass[User]

val kmeans = new KMeans()

.setK(5)

.set...

val clusters = kmeans.run(users.map(_.toVector))

Page 45: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()

val exerciseDeviations = events

.filterClass[EntireResistanceExerciseSession]

.flatMap(session =>

session.sets.flatMap(set =>

set.sets.map(exercise => (session.id.id, exercise.exercise))))

.groupBy(e => e)

.map(g =>

Rating(normalize(g._1._1), normalize(g._1._2),

normalize(g._2.size)))

val model = new ALS().run(ratings)

val predictions = model.predict(recommend)

bench press

bicep curl

dead lift

user 1 5 2

user 2 4 3

user 3 5 2

user 4 3 1

Page 46: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

val events = sc.eventTable().cache().toDF()

val lr = new LinearRegression()

val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),

new IntensityFeatureExtractor(), lr))

val paramGrid = new ParamGridBuilder()

.addGrid(lr.regParam, Array(0.1, 0.01))

.addGrid(lr.fitIntercept, Array(true, false))

getEligibleUsers(events, sessionEndedBefore)

.map { user =>

val trainValidationSplit = new TrainValidationSplit()

.setEstimator(pipeline)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

val model = trainValidationSplit.fit(

events,

ParamMap(ParamPair(userIdParam, user)))

val testData = // Prepare test data.

val predictions = model.transform(testData)

submitResult(userId, predictions, config)

}

Page 47: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()

val connections = events.filterClass[Connections]

val vertices: RDD[(VertexId, Long)] =

connections.map(c => (c.id, 1l))

val edges: RDD[Edge[Long]] = connections

.flatMap(c => c.connections

.map(Edge(c.id, _, 1l)))

val graph = Graph(vertices, edges)

val ranks = graph.pageRank(0.0001).vertices

Page 48: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 49: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 50: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 51: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

7 * Dumbbell Alternating Curl

Page 52: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Data

Data

Preprocessing

Preprocessing

Features

Features

Training

Testing

Error %

Page 53: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Exercise domain as an example● Analytics of both batch (offline) and streaming (online) data

● Analytics important in other areas (banking, stock market, network, cluster monitoring, business intelligence, commerce, internet of things, ...)

● Enabling value of data

Page 54: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Event sourcing● CQRS● Technologies to handle the data

○ Spark○ Mesos○ Akka○ Cassandra○ Kafka

● Handling data● Insights and analytics enable value in data

Page 55: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015
Page 56: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

● Jobs at www.cakesolutions.net/careers● Code at https://github.com/muvr ● Martin Zapletal @zapletal_martin ● Anirvan Chakraborty @anirvan_c

Page 57: Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

[1] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

[2] http://malteschwarzkopf.de/research/assets/google-stack.pdf

[3] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

[4] http://www.slideshare.net/LisaHua/spark-overview-37479609