What is FullStory?• An Atlanta based startup
• A bunch of former ATL Googlers
• Launched ~9 months ago
• A service for deeply understanding your users
• @fullstorywww.fullstory.com
What is FullStory?• “Capture everything” approach to understanding
user behavior
• One script on page (think Google Analytics; no wiring events)
• Captures all DOM, changes, CSS, etc.
• A demo shows this way better than I can…
Analysis PipelineCompute Engine
Once a page is finished recording…(either UNLOAD event or idle timeout)
System Requirements• Must be reliable
• Must be cheap (high scale)
• Must be scalable
• ~ 1:1 reads/writes (for now…)
Bundle Schema
CREATE TABLE event_bundles ( org_id ascii, user_id bigint, session_id bigint, page_id bigint, seqno int, bundle blob, PRIMARY KEY ((org_id, user_id, session_id, page_id), seqno));
Bundle Schema
CREATE TABLE event_bundles ( org_id ascii, user_id bigint, session_id bigint, page_id bigint, seqno int, bundle blob, PRIMARY KEY ((org_id, user_id, session_id, page_id), seqno));
Partition Key
Bundle Schema
CREATE TABLE event_bundles ( org_id ascii, user_id bigint, session_id bigint, page_id bigint, seqno int, bundle blob, PRIMARY KEY ((org_id, user_id, session_id, page_id), seqno));
Clustering Column
Bundle Schema
CREATE TABLE event_bundles ( org_id ascii, user_id bigint, session_id bigint, page_id bigint, seqno int, bundle blob, PRIMARY KEY ((org_id, user_id, session_id, page_id), seqno));
Bundle of DOM events
Bundle Schema• One gotcha: occasionally we have long
tombstone scans
WARN [ReadStage:6] 2015-01-22 12:37:00,185 SliceQueryFilter.java (line 225) Read 0 live and 1048 tombstoned cells in fullstory.event_bundles (see tombstone_warn_threshold).
Our current usage• 3 nodes (8x cores each)
• 1 TB SSDs each (network attached)
• v2.0.13
• RF=3, QUORUM reads and writes
• 4 week TTL on all bundles
• ~0.5 TB new data per day
Managing Repairs• Manual for a long time
• Luckily our data is duplicate insensitive (due to seqnums)
• Now we have an App Engine cron job that calls forceRepairAsync() via JMX
• Change to incremental repairs would probably help (introduced in v2.1)
Don’t redline the CPU• CPU spikes happen:
• Compaction
• Read repair
• Node repair
• GC
• Unlucky reads (hitting lots of SSTables)
Don’t redline your disk either• “Compaction needs free
disk space to write the new SSTable, before the SSTables being compacted are removed.”[1]
• For SizeTiered compaction, worst case is 50% of disk!
• [1] http://www.slideshare.net/planetcassandra/201404-cluster-sizing
Future Plans• DateTieredCompactionStrategy
• Committed by Björn Hegerfors (Spotify) Oct 2014
• Incremental Repairs
Future Plans• New C* cluster!
• The output of event processing is a set of JSON documents
• Currently we store these in Google Cloud Storage (GCS)
• GCS storage is cheap, but we are paying too much for upload requests
GcsMux• Currently in development
• Idea: gather up multiple small files on disk, merge them into a single file to be uploaded
• How do you retrieve a file later? Store filename & location in Cassandra.