Transcript
About Me
• Thierry Schellenbach
• Founder/ CTO Fashiolista
• Github/tschellenbach
• Feedly & Django Facebook
• Blog: mellowmorning.com
• @tschellenbach
Today
• Fashiolista’s growth
• Pre Cassandra feed systems
• Github/tschellenbach/Feedly
– Cassandra learnings
– Remaining challenges
Growth
2nd largest fashion community
• 1.5M members
• 17M loves/month
• 94M pageviews (google analytics)
Our Stack
• Django/Python
• PostgreSQL/ Pgbouncer
• Cassandra
• Redis
• Solr
• Celery/ RabbitMQ
• AWS/ Ubuntu
• Nginx/ Gunicorn/ Supervisor
• Newrelic, Datadog & Sentry
Feed History
1. PostgreSQL
2. Redis – Feedly 0.1
3. Cassandra – Feedly 0.9
More details in this highscalability post:
http://bit.ly/hsfeedly
PostgreSQL - Pull
1. Smooth till we reached ~100M activities
2. Spikes in performance due to the query planner
Cassandra - Feedly 0.9
1. Few moving components
2. Supported by Datastax
3. Instagram
4. Easy to add capacity
5. Cost effective
We open sourced Feedly!
• Github/tschellenbach/Feedly
• Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis
Cassandra Challenges
1. Which Python library to chose?
• Pycassa
• CQLEngine (using the old CQL module)
• Python-Driver (beta)
• Fork CQLEngine to support Python-Driver
– Github/tbarbugli/cqlengine
Cassandra Challenges
2. Importing data(300M loves * 1000 followers = 300 billion activities)
• High CPU load
• Nodes going down
• Start with many nodes, scale down afterwards
Cassandra Challenges
3. Optimizing import speed (300M loves * 1000 followers = 300 billion activities)
• Python-Driver
• Batch queries
• Non-Atomic (unlogged) batch queries
• Prepared statements
Cassandra Challenges
4. Data model denormalization
CREATE TABLE fashiolista_feedly.timeline_flat (feed_id ascii, activity_id varint, actor int, extra_context blob, object int,target int, time timestamp, verb intPRIMARY KEY (feed_id, activity_id) ) WITH CLUSTERING ORDER BY (activity_id ASC)
AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
Evaluation
7 instances, m1.xlarge, 2.59 TBCassandra 2.0.0, CQL3, Python-driver(Would have been one expensive Redis cluster)
Current Challenges
How do we limit the storage for feeds?
Trimming?(Not supported)
DELETE from timeline_flat WHERE activity_id < 5000
Use a TTL on the rows?
Fork Feedly
This is our first time using Cassandra, let us know how we can further speedup our implementation:
http://bit.ly/feedlycassandra
top related