Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Playback

Recording the web: High-fidelity storage

and playbackIan Rose

[email protected] @ianthomasrose

mailto:[email protected]

What is FullStory?• An Atlanta based startup

• A bunch of former ATL Googlers

• Launched ~9 months ago

• A service for deeply understanding your users

• @fullstorywww.fullstory.com

http://www.fullstory.com

What is FullStory?• “Capture everything” approach to understanding

user behavior

• One script on page (think Google Analytics; no wiring events)

• Captures all DOM, changes, CSS, etc.

• A demo shows this way better than I can…

Demo

Recording PipelineClients

Recording PipelineClients App Engine

frontendsDOM

Events


frontendsCompute Engine

DOMEvents


frontends BridgeCompute Engine

DOMEvents HTTP

CQL

Analysis PipelineCompute Engine

Once a page is finished recording…(either UNLOAD event or idle timeout)


Read /Delete

“Cookers”


Read /Delete

“Cookers”

Google CloudStorage

Upload

System Requirements• Must be reliable

• Must be cheap (high scale)

• Must be scalable

• ~ 1:1 reads/writes (for now…)

Bundle Schema

CREATE TABLE event_bundles ( org_id ascii, user_id bigint, session_id bigint, page_id bigint, seqno int, bundle blob, PRIMARY KEY ((org_id, user_id, session_id, page_id), seqno));

Bundle Schema


Partition Key

Bundle Schema


Clustering Column

Bundle Schema


Bundle of DOM events

Bundle Schema• One gotcha: occasionally we have long

tombstone scans

WARN [ReadStage:6] 2015-01-22 12:37:00,185 SliceQueryFilter.java (line 225) Read 0 live and 1048 tombstoned cells in fullstory.event_bundles (see tombstone_warn_threshold).

Our current usage• 3 nodes (8x cores each)

• 1 TB SSDs each (network attached)

• v2.0.13

• RF=3, QUORUM reads and writes

• 4 week TTL on all bundles

• ~0.5 TB new data per day

Growing Pains

Managing Repairs• Manual for a long time

• Luckily our data is duplicate insensitive (due to seqnums)

• Now we have an App Engine cron job that calls forceRepairAsync() via JMX

• Change to incremental repairs would probably help (introduced in v2.1)

Don’t redline the CPU• CPU spikes happen:

• Compaction

• Read repair

• Node repair

• GC

• Unlucky reads (hitting lots of SSTables)

Don’t redline your disk either• “Compaction needs free

disk space to write the new SSTable, before the SSTables being compacted are removed.”[1]

• For SizeTiered compaction, worst case is 50% of disk!

• [1] http://www.slideshare.net/planetcassandra/201404-cluster-sizing

http://www.slideshare.net/planetcassandra/201404-cluster-sizing

Future Plans

Future Plans• DateTieredCompactionStrategy

• Committed by Björn Hegerfors (Spotify) Oct 2014

• Incremental Repairs

Future Plans• New C* cluster!

• The output of event processing is a set of JSON documents

• Currently we store these in Google Cloud Storage (GCS)

• GCS storage is cheap, but we are paying too much for upload requests

GcsMux• Currently in development

• Idea: gather up multiple small files on disk, merge them into a single file to be uploaded

• How do you retrieve a file later? Store filename & location in Cassandra.

Thanks!

Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Playback

Technology

Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Playback