Techtalktrack2 sid-final-130207111143-phpapp02

Post on 05-Dec-2014

424 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

Transcript

Flipkart Website Architecture

Mistakes & Learnings

Siddhartha ReddyArchitect, Flipkart

June 2007

November 2007

December 2012

www.flipkart.com

• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as…

• [1] Issue: Website is “slow”• [2] RCA = Root Cause Analysis

Issue[1] RCA[2] Actions Learnings

INFANCY (2007 – MID-2010)Surviving & reacting to the environment

Website is “slow”!

RCA

• Why?– MySQL queries taking too long

• Why?– Too many queries– Many slow queries– Queries locking tables

• Why?– Capacity

• Hmm…

Fixing it

• Get beefier servers (the obvious)• Separate master_db, slave_db– Writes go to master_db– Reads from slave_db– Critical reads from master_db

MySQL

ReadsWrites

MySQL

Master

Writes

MySQL

Slave

Reads

Replication

Learning from it

• Scale-out databases reads by distributing load across systems

• Isolate database writes from reads– Writes are (usually) more critical

Website is “slow”!(Again)

RCA

• Why?– MySQL queries taking too long (on slave_db)

• Why?– Too many queries– Many slow queries

• Why?– Queries from analytics / reporting and other

backend jobs• Urm…

Fixing it

• Analytics / reporting DB (archival_db)– Use MyISAM — optimized for reads– Additional indexes for quicker reporting

MySQL

Master

Website

Writes

MySQL

Slave

Website

Reads

Analytics

Reads

Replicatio

n

MySQL

Master

Website Writes

MySQL

Slave 1

Website

Reads

Replication

MySQL Slave 2

Analytics Reads

Replication

Learning from it

• Isolate the databases being used for serving website traffic from those being used for analytical/reporting

• Isolate systems being used by production website from those being used for background processing

BABY (2010 – 2011)Learning the basics

Website is “slow”!

RCA

• Why?• How?– Instrumentation

RCA - 1

• Why?– Logging a lot– PHP processes blocking on writing logs

Log file

Request1-> Process1

Request2-> Process2Request3

-> Process3Waiting

Request2:Process1

Waiting

Request2:Process2

Writing

Request3:Process3

RCA - 2

• Why?– Service Oriented Architecture (SOA)– Too many calls to remote services per request• Creating fresh connection for each call• All the calls are made in serial order

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

RCA - 3

• Why?– Configurability– Fetch a lot of “config” from database for serving

each request

Receive request

Fetch Config1

Fetch Config2

Fetch Config3

Fetch Config4

Send response

Database

RCA – 1,2,3

• Why?– Logging a lot– SOA– Configurability

• Why?– PHP’s process model

• Argh!

Fixing it

• fk-w3-agent– Simple Java “middleware” daemon– Deployed on each web server– PHP communicates to it through local socket– Hosts pluggable “handlers”

fk-w3-agent: LoggingHandler

Log file

Request1->

Process1

Request2->

Process2

Request3->

Process3

fk-w3-agent

Request1->

Process1

Request2->

Process2

Request3->

Process3

Log file

Async / buffered

fk-w3-agent: ServiceHandler(s)

Receive request Callfk-w3-agent

Send response

fk-w3-agent

Service1Service2

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

fk-w3-agent: ConfigHandlerReceiv

e reques

t

Fetch Config

1

Fetch Config

2

Fetch Config

3

Fetch Config

4

Send respon

se

Database

Receive request Fetch all config fromfk-w3-agent Send response

fk-w3-agent

Database

Poll and cache

Learning from it

• PHP — good for frontend and templating– Gives a lot of agility– Limiting process model• Hurdle for high performance

• Java — stability and performance• Horses for courses

Website is “slow”!(Again)

RCA

• Why?– PHP processes taking up too much time– PHP processes taking up too much CPU

• Why?– Product info deserialization taking up time/CPU– View construction taking up time/CPU

Fixing it

• Caching!• Cache fully constructed pages– For a few minutes– Only for highly trafficked pages (Homepage)

• Cache PHP serialized Product objects– ~20 million objects– Memcache

• Yeah! But…– Add caching => add complexity

Caching: Complications (1)

• “Caching fully constructed pages”• But parts of pages still need to be dynamic

• Example: Logged-in user’s name

• Impossible to do effective bucket testing• Or at least makes it prohibitively complex

Caching: Complications (2)

• “Caching PHP serialized Product objects”• Without caching:

• With caching, cache hit:

• With caching, cache miss:

getProductInfo() Fetch from CMS

getProductInfo() Fetch from Cache

getProductInfo()

Fetch from Cache

Fetch from CMS Set in Cache

Caching: Complications (3)

• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache– Receive “notifications” about product updates• Notification Server — pushes notifications raised by

CMS

• Use a persistent, distributed cache– Memcache => Membase, Couchbase

Learning from it

• Caching is a powerful tool for performance optimization

• Caching adds complexities– Reduced by keeping cache close to data source– Think deeply about TTL, invalidation

• Use caching to go from “acceptable performance” to “awesome performance”– Don’t rely on it to get to “acceptable

performance”

KID (2012)Growing up

Website is “slow”!

RCA

• Why?– Search-service is slow (or Reviews-service is slow

or Recommendations-service is slow)• But why is rest of website slow?– Requests to the slow service are blocking

processing threads• Eh?!

Let’s do some math

• Let’s say– Mean (or median) response time: 100 ms– 8-core server– All requests are CPU bound

• Throughput: 80 requests per second (rps)• Let’s also say

– 95th Percentile response time: 1000 ms• Call them “bad requests”

• 4 bad requests in a second– Throughput down to 44 rps

• 8 bad requests in a second?– Throughput down to 8 rps

Fixing it

• Aggressive timeouts for all service calls– Isolate impact of a slow service• only to pages that depend on it

• Very aggressive timeouts for non-critical services– Example: Recommendations• On a Product page, Search results page etc.• Not on My Recommendations page

• Load non-critical parts of pages through AJAX

Learning from it

• Isolate the impact of a poorly performing services / systems

• Isolate the required from the good-to-have

Website is “slow”!(Again)

RCA

• Why?– Load average of web servers has spiked

• Why?– Requests per second has spiked• From 1000 rps to 1500 rps

• Why?– Large number of notifications of product

information updates

Fixing it

• Separate cluster for receiving product info update notifications from the cluster that serves users

• Admission control: Don’t let a system receive more requests than it can handle– Throttling

• Batch the notifications

Learning from it

• Isolate the systems serving internal requests from those serving production traffic

• Admission control to ensure that a system is isolated from the over-enthusiasm of a client

• Look at the granularity at which we’re working

TEENAGERIncreasing complexity

THANK YOU

Mistake?

• Sub-optimal decision– Not all information/scenarios considered– Insufficient information– Built for a different scenario

• Due to focus on “functional” aspects• A mistake is a mistake– … in retrospect

top related