Top Banner
Scaling : What Went Wrong, What Went Right Ross Snyder [email protected] @beamrider9 Sept. 30, 2011 1
106

Scaling Etsy: What Went Wrong, What Went Right

May 14, 2015

Download

Technology

Ross Snyder

Slides for the talk given at Surge 2011.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling Etsy: What Went Wrong, What Went Right

Scaling :What Went Wrong,

What Went Right

Ross [email protected]@beamrider9 Sept. 30, 2011

1

Page 2: Scaling Etsy: What Went Wrong, What Went Right

Etsy is the world’s handmade marketplace.

(vintage and supplies, too)

2

Page 3: Scaling Etsy: What Went Wrong, What Went Right

Etsy was founded in mid-2005 and is constantly growing.

Gross Merchandise Sales ($MM)

3

Page 4: Scaling Etsy: What Went Wrong, What Went Right

Four employees, one web*, one db, founder’s apartment

June2005:

* until getting slashdotted by a link from Boing Boing in Aug. 2005

From humble beginnings...

4

Page 5: Scaling Etsy: What Went Wrong, What Went Right

250+ employees, multiple offices, billions of pageviews

Sept.2011:

... to today’s handmade juggernaut.

(NYC Mayor Mike Bloomberg visited Etsy in June 2011)

5

Page 6: Scaling Etsy: What Went Wrong, What Went Right

How’d we get here?

6

Page 7: Scaling Etsy: What Went Wrong, What Went Right

Answer: with some difficulty.“There is no education like adversity.” - Benjamin Disraeli

7

Page 8: Scaling Etsy: What Went Wrong, What Went Right

A few disclaimers

8

Page 9: Scaling Etsy: What Went Wrong, What Went Right

Hindsight is 20/20

9

Page 10: Scaling Etsy: What Went Wrong, What Went Right

“History is written by the victors”

10

Page 11: Scaling Etsy: What Went Wrong, What Went Right

Etsy thrives today because of what

its early employees accomplished

11

Page 12: Scaling Etsy: What Went Wrong, What Went Right

Your narrator wasn’t present for mostof the events covered in this talk

12

Page 13: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

13

Page 14: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

14

Page 15: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Most business logic inPostgres stored procedures

15

Page 16: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Front end / database interaction = stored procedure calls wrapped with PHP functions

16

Page 17: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Some database partitioning by feature,but still with a large central DB

17

Page 18: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Site uptime = not great

18

Page 19: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

“How do we scale?”

19

Page 20: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

“Let’s write some middleware!”

(runners up: “Let’s rewrite the site in Java!”and “Let’s rewrite the site in Python!”)

20

Page 21: Scaling Etsy: What Went Wrong, What Went Right

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.”

Conway’s Law:

- Melvin Conway, 1968

21

Page 22: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

22

Page 23: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

Devs write code

23

Page 24: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

DBAs write SQL

24

Page 25: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

Ops deploys code & touches prod

25

Page 26: Scaling Etsy: What Went Wrong, What Went Right

SILOS

26

Page 27: Scaling Etsy: What Went Wrong, What Went Right

Etsy’s big bet: “Sprouter”(the Stored Procedure Router)

27

Page 28: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Runs on each webserver,listens on port 8010

28

Page 29: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Maps name/arguments to a Postgres stored procedure, calls it, returns results

29

Page 30: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Caches things

30

Page 31: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Supports sharding (in theory)

31

Page 32: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Devs write PHP, DBAs write SQL,meet somewhere in the middle

32

Page 33: Scaling Etsy: What Went Wrong, What Went Right

SILOS

33

Page 34: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

The hope: easier to scale Sprouterthan to scale the database itself

34

Page 35: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

(scaling the db when everything’s in stored procedures = somewhere between

hard and impossible)

35

Page 36: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

36

Page 37: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated37

Page 38: Scaling Etsy: What Went Wrong, What Went Right

What happened?

38

Page 39: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Forcibly centralizes database access

39

Page 40: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Hides data store implementationfrom caller

40

Page 41: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Opens the door for“clever” automatic caching

41

Page 42: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Prevents developers from writing SQL (?)

42

Page 43: Scaling Etsy: What Went Wrong, What Went Right

43

Page 44: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Creates substantial developer friction

44

Page 45: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Homegrown daemon + dependenciesfor Ops to maintain

45

Page 46: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Lack of community support / provability

46

Page 47: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Complex synchronization required to deploy (due to tight coupling with Postgres)

47

Page 48: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Database remains single point of failure(sharding features never fully formed)

48

Page 49: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

49

Page 50: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development+ Negligible (negative?) effect on site reliability

50

Page 51: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Negligible (negative?) effect on site reliability

51

Page 52: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

+ Negligible (negative?) effect on site reliability

52

Page 53: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

=

+ Negligible (negative?) effect on site reliability

53

Page 54: Scaling Etsy: What Went Wrong, What Went Right

How did attitudes change so quickly?

54

Page 55: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated55

Page 56: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

56

Page 57: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Just as Sprouter went live, many of its strongest proponents departed Etsy

57

Page 58: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Taking with them...

58

Page 59: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Devotion to Postgres stored procedures / types

59

Page 60: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Fear of developers writing SQL

60

Page 61: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Fear of developers touching prod

61

Page 62: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Infrequent / large deploys to production

62

Page 63: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

“Not developed here”

63

Page 64: Scaling Etsy: What Went Wrong, What Went Right

Fall

’08

Then Now

The Great Etsy Culture Shift

64

Page 65: Scaling Etsy: What Went Wrong, What Went Right

DevOps

65

Page 66: Scaling Etsy: What Went Wrong, What Went Right

DevOps

Silos = bad

66

Page 67: Scaling Etsy: What Went Wrong, What Went Right

DevOps

Trust, cooperation, transparency,shared responsibility = good

67

Page 68: Scaling Etsy: What Went Wrong, What Went Right

DevOps

“We’re all in this together”

68

Page 69: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Stabilize the site

69

Page 70: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Improve metrics & monitoring

Stabilize the site

70

Page 71: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

StatsDhttp://github.com/etsy/statsd

Stabilize the site

71

Page 72: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Upgrade database hardwarevertically as far as possible

Stabilize the site

72

Page 73: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Give developers production access to help troubleshoot problems

Stabilize the site

73

Page 74: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Continuous Deployment

74

Page 75: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Any engineer can deploy to prod(generally happens 25+ times per day)

Continuous Deployment

75

Page 76: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Deployinatorhttp://github.com/etsy/deployinator

Continuous Deployment

76

Page 77: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

One button that deploys the site

Continuous Deployment

77

Page 78: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Small changesets, deployed frequently

Continuous Deployment

78

Page 79: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Requires solid tests,good communication

Continuous Deployment

79

Page 80: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Distributed developer-driven QA

Continuous Deployment

80

Page 81: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Circumvent Sprouter

81

Page 82: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Object-Relational Mapping (ORM)

Circumvent Sprouter

82

Page 83: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

aka “The Vietnam of Computer Science”(Google it)

Circumvent Sprouter

83

Page 84: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Front-end PHP talks directly to database via ORM (also written in PHP)

Circumvent Sprouter

84

Page 85: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

ORM can cache where appropriate(as can front end)

Circumvent Sprouter

85

Page 86: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Database Sharding

86

Page 87: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Etsy has a lot of DNA from flickr -including their DB sharding scheme

Database Sharding

87

Page 88: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Based on MySQL

Database Sharding

88

Page 89: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Battle-tested, well-known

Database Sharding

89

Page 90: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Scales horizontally to infinity(or close enough)

Database Sharding

90

Page 91: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

No single points of failure(master-master replication)

Database Sharding

91

Page 92: Scaling Etsy: What Went Wrong, What Went Right

Gradually phase out Sprouter,phase in ORM / sharded data

The Way Forward: Part 4Database Sharding

92

Page 93: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

93

Page 94: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

Spring ’11: Sprouter turned off

94

Page 95: Scaling Etsy: What Went Wrong, What Went Right

95

Page 96: Scaling Etsy: What Went Wrong, What Went Right

Lessons Learned

96

Page 97: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

97

Page 98: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2011

Operating System:

Database:

Webserver:

Languages:

98

Page 99: Scaling Etsy: What Went Wrong, What Went Right

Open & trusting > closed & afraid(DevOps DevOps DevOps)

99

Page 100: Scaling Etsy: What Went Wrong, What Went Right

Front end/database interaction is too critical to take chances on novel/untested solutions

100

Page 101: Scaling Etsy: What Went Wrong, What Went Right

Side corollary: If you’re doing something “clever”, you’re probably doing it wrong

101

Page 102: Scaling Etsy: What Went Wrong, What Went Right

The architectural decisions you make today will have large impact long after you’re gone

102

Page 103: Scaling Etsy: What Went Wrong, What Went Right

No architectural hole is so deep that proven scaling strategies don’t exist for digging out

103

Page 104: Scaling Etsy: What Went Wrong, What Went Right

We are probably making decisions today that will be the subject of a similar talk in 2015

Acknowledgement

104

Page 105: Scaling Etsy: What Went Wrong, What Went Right

Learn More:http://codeascraft.etsy.com/@codeascraft

105

Page 106: Scaling Etsy: What Went Wrong, What Went Right

Etsy is hiring!http://www.etsy.com/careers@etsy

106