Put your thinking caps on!• Lets design an e-com web-site which should
• capture all user interactions (every event)
• should be able to run analytics and come up with good recommendations
• have a stock ticker for the owner to monitor performance across categories
This is what it looks like!
Website with recommendations
category wise conversions ticker for the owner
Components?
UI for users
Queuing
Database
UI to monitor performance
Data fetcher
Analytics algorithms (R)
Correct way?
• Yes as-long-as-it-works • build simple solutions with lesser time to market • Don’t run blind
Problems?• Which queuing system to choose?
• How do I handle the load?
• How do I provide real time insights?
• How real time is data fetcher?
• When am I doomed?
Capture every stat! Monitor everything!
• Stats logging tools
• Graphite
• Ganglia
• OpenTSD
• Monitoring tools
• Nagios
• Bosun
Graphite
Nagios
Service based architecture
Database
Service layer
UI for the user
UI for the
owner
Queue
Analytics algo (R)
Scaling up service layer
• Load balancing + auto scaling
• stateless services - easier to scale
Scaling up app layer• Distributed scheduler
• Map-Reduce jobs
• Storm
• Spark
• Kafka + storm for stream processing
• SQS
#mychoice?• HBase, Mongo, neo4j are cool
• operational maturity
• expertise / skills
• MySQL / PostgreSQL
• Every computer engineer would have learnt this in college
• Start with a simple solution, capture right signals, know when to scale
Signals to capture• Disk usage
• RAM usage
• size of indexes
• Disk / RAM ratio
• Slow logs
• Table crashes
• Box crashes
• Number of queries
• Locks? Lock wait timeouts?
Scaling up database layer• Probably the hardest
• Inherently stateful!
• Replication is a must
• Large data-sets! - GBs, TBs, PBs - keeps growing
• fault tolerance harder
• “last mile” of complete web-stack scalability
Challenges for high volume MySQL
• Indexes don’t fit in memory any more!
• schema changes are harder / impossible
• frequent table crashes
• Reliable backup-restore
• locking issues
Sharding• Scale out • MySQL clustering
DB Service
Routing Table
DatabaseDatabaseDatabase
Helps?• Small databases are fast
• Bigger ones are slower
• keep them small and reap the benefits
• Run queries using parallel processing and collate the results
• Keep collecting stats!
• Re-shard when needed
• replication lag can result in lost transactions
#NoSQL• Johan Oskarsson
• In-Memory database
• eventual consistency
• no transactional support
• Typical NoSQL DBs
• Document databases
• key/value store
• Hybrid
• graph databases
• columnar databases
Criteria for choosing a DB• ACID Properties
• Join support?
• Performance (inserts, updates, queries, deletes)
• Machine requirements -> TCO
• Community edition / enterprise edition / community support
• Schemaless?
• scalable?
• write-to-master-read-from-slave
• Always consistent / eventual consistency
• Business problem being solved
Document stores{ “customer_id” : 842378947, “customer” : { “name”: “Harshad”, “company”: “Sokrati”, “interestAreas”: “Algorithms, Analytics”, }, “Address”: { . . }, }
Column Family
Key-value stores
Key1
Key2
.
.
.
KeyN
Value1
Value2
ValueN
Graph Databases
Lets re-design our solution for scale!
Problem statement• Lets design an e-com web-site which should
• capture all user interactions (every event)
• should be able to run analytics and come up with good recommendations
• have a stock ticker for the owner to monitor performance across categories
UI (user)
UI (Owner)
Kafka
Data collector
DB Service
DatabaseDatabaseMySQLETLDatabaseDatabaseColumnar
DB
Service Analytics algorithms Mongo
Service
Data Collector
Graph DB
Collect every stat! Monitor every event!
Sokrati architecture evolution
Single MySQL server
Sharded MySQL Solution
HBaseDatabase
as a service
Sharded Columnar
DBs
DB As A Service• We decided to build our DB warehouse as a service
• for it makes developers life easier
• for it makes schema modifications seamless
• for it makes database choice more flexible
• for it lets app teams focus exclusively on business logic
• One service to rule all data :-)
Take-aways• All the databases are here to stay
• Your solution will have a combination of databases
• Choose the right one for your problem
• Business needs drive selection
• collect every stat, monitor every event!
• Be prepared for a failure