Top Banner
Hello. Hi everyone.
163

Inside GitHub

Jan 12, 2015

Download

Technology

``err

Talk given at the Gilt Groupe Experts Talk in NYC, December 2009.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Inside GitHub

Hello.Hi everyone.

Page 2: Inside GitHub

My name is Chris Wanstrath. I go by @defunkt online.

Page 3: Inside GitHub

insidegithub

And today I’m going to talk about GitHub.

Page 4: Inside GitHub

insidegithub

That’s me.

Page 5: Inside GitHub

GitHub is what we like to call “social coding.”

Page 6: Inside GitHub

You can see what your friends are doing from your dashboard or news feed

Page 7: Inside GitHub

Everyone has a profile showing off their code and activity

Page 8: Inside GitHub

And you can do things like leave comments on commits.

Page 9: Inside GitHub

But it wasn’t always like this.

Page 10: Inside GitHub

Originally we just wanted to make a git hosting site.

In fact, that was the first tagline.

Page 11: Inside GitHub

git repository hosting

git repository hosting.

That’s what we wanted to do: give us and our friends a place to share git repositories.

Page 12: Inside GitHub

It’s not easy to setup a git repository. It never was.

But back in 2007 I really wanted to.

Page 13: Inside GitHub

I had seen Torvalds’ talk on YouTube about git.

But it wasn’t really about git - it was more about distributed version control.

It answered many of my questions and clarified DVCS ideas.

I still wasn’t sold on the whole idea, and I had no idea what it was good for.

Page 14: Inside GitHub

CVS is stupid

But when Torvalds says “CVS is stupid”

Page 15: Inside GitHub

and so are you

“and so are you,” the natural reaction for me is...

Page 16: Inside GitHub

To start learning git.

Page 17: Inside GitHub

At the time the biggest and best free hosting site was repo.or.cz.

Page 18: Inside GitHub

Right after I had seen the Torvalds video, the god project was posted up on repo.or.cz

I was interested in the project so I finally got a chance to try it out with some other people.

Page 19: Inside GitHub

Namely this guy, Tom Preston-Werner.

Seen here in his famous “I put ketchup on my ketchup” shirt.

Page 20: Inside GitHub

I managed to make a few contributions to god before realizing that repo.or.cz was not different.

git was not different.

Just more of the same - centralized, inflexible code hosting.

Page 21: Inside GitHub

This is what I always imagined.

No rules. Project belongs to you, not the site. Share, fork, change - do what you want.

Give people tools and get out of their way. Less ceremony.

Page 22: Inside GitHub

So, we set off to create our own site.

A git hub - learning, code hosting, etc.

Page 23: Inside GitHub

We started with the code browsing and commit viewing...

Page 24: Inside GitHub

But once we added the current version of the dashboard, we knew this was different.

Page 25: Inside GitHub

And eventually “git repository hosting” gave way to “social coding”

Page 26: Inside GitHub

What’s special about GitHub is that people use the site in spite of git.

Many git haters use the site because of what it is - more than a place to host git repositories, but a place to share code with others.

Page 27: Inside GitHub

a briefhistory

So that’s how it all started.

Now I want to (briefly) cover some milestones and events.

Page 28: Inside GitHub

2007 october

The first commit was on a Friday night in October, around 10pm.

Page 29: Inside GitHub

2008 january

We launched the beta in January at Steff’s on 2nd street in San Francisco’s SOMA district.

The first non-github user was wycats, and the first project was merb-core.

They wanted to use the site for their refactoring and 0.9 branch.

Page 30: Inside GitHub

2008 april

A few short months after that we launched to the public.

Page 31: Inside GitHub

2009 january

In January of this year, we were awared the “Best Bootstrapped Startup”by TechCrunch.

Page 32: Inside GitHub

2009 april

Then in April we were featured as some of the best young tech entrepreneurs in BusinessWeek.

(Finally something to show mom)

Page 33: Inside GitHub

2009 june

Our Firewall Install, something we’d been talking about since practicallyday one, was launched in June of 2009.

Page 34: Inside GitHub

2009 september

And in September we moved to Rackspace, our current hosting provider.

(Which some of you may have noticed.)

Page 35: Inside GitHub

Along the way we managed to pick up Scott Chacon, our VP of R&D

Page 36: Inside GitHub

Tekkub, our level 80 support druid

Page 37: Inside GitHub

Melissa Severini, who keeps us all in check

Page 38: Inside GitHub

Kyle Neath, who makes the site pretty

Page 39: Inside GitHub

And Ryan Tomayko, who helps keep the site running smoothly.

Page 40: Inside GitHub

Oh yeah, and the other founders: PJ and Tom.

Page 41: Inside GitHub

github.com

That’s where we’re at today.

So let’s talk about the technical details of the website: github.com

Page 42: Inside GitHub

.com as opposed to fi, which I’m not going to get into today.

You’ll have to invite PJ out if you want to hear about that.

Page 43: Inside GitHub

theweb app

As everyone knows, a web “site” is really a bunch of different components.

Some of them generate and deliver HTML to you, but most of them don’t.

Either way, let’s start with the HTMLy parts.

Page 44: Inside GitHub

rails

We use Ruby on Rails 2.2.2 as our web framework.

It’s kept up to date with all the security patches and includes custom patches we’ve addedourselves, as well as patches we’ve cherry-picked from more recent versions of Rails.

Page 45: Inside GitHub

We found out Rails was moving to GitHub in March 2008, after we had reached out tothem and they had turned us down.

So it was a bit of a surprise.

Page 46: Inside GitHub

rails

But there are entire presentations on Rails, so I’m not going to get further into it here.

As for whether it scales or not, we’ll let you know when we find out. Because so farit hasn’t come close to presenting a problem.

Page 47: Inside GitHub

rack

One of the big features in Rails 2.3 is Rack support.

Page 48: Inside GitHub

We badly wanted this, but didn’t want to invest the time upgrading.

So using a few open source libraries we’ve wrapped our Rails 2.2.2 instance in Rack.

Page 49: Inside GitHub

Now we can use awesome Rack middleware like Rack::Bug in GitHub

Page 50: Inside GitHub

In fact, the Coderack competition is about to open voting to the public this week.

Coders created and submitted dozens of Rack middleware for the competition.

I was a judge so I got the see the submissions already. Some of my favoritewere

Page 51: Inside GitHub

nerdEd / rack-validate

Page 52: Inside GitHub

webficient / rack-tidy

Page 53: Inside GitHub

talison / rack-mobile-detect

sets the X_MOBILE_DEVICE header to the mobile device, if recognized

Page 54: Inside GitHub

unicorn

We use unicorn as our application server

- master / worker- 16 workers- preforking

Page 55: Inside GitHub

unicorn

- instant restart after kill- hard 30s request timeouts- control ram growth

Page 56: Inside GitHub

unicorn

- 0 downtime deploys- protects against bad rails startup- migrations handled old fashioned way

Page 57: Inside GitHub

nginx

For serving static content and slow clients, we use nginx

nginx is pretty much the greatest http server ever

it’s simple, fast, and has a great module system

Page 58: Inside GitHub

nginxLimit Zone

Limit simultaneous connections from a client

Page 59: Inside GitHub

nginxLimit Requests

Limit frequency of connections from a client

Anti-DDOS

Page 60: Inside GitHub

nginx

I see many people using Rack to do what the Limit modules do.

Don’t.

Page 61: Inside GitHub

nginxmemcached

memcached support

can serve directly from memcached

Page 62: Inside GitHub

nginxPush Module

comet!

Page 63: Inside GitHub

git

The next major part of GitHub is git

Page 64: Inside GitHub

grit

We wrote an open source library called Gritwhich lets us use git from Ruby

Page 65: Inside GitHub

mojombo / grit

you can get it here

it originally shelled out to git and just parsed the responses.

which worked well for a long time.

Page 66: Inside GitHub

gritFile.read()

Eventually we realized, however, that File.read() can be 100 times faster

Page 67: Inside GitHub

gritsystem()

Than shelling out

Page 68: Inside GitHub

One of the first things Scott worked on was rewriting the core parts of Gritto be pure Ruby

Basically a Ruby implementation of Git

Page 69: Inside GitHub

mojombo / grit

And that’s what we run now

Page 70: Inside GitHub

smoke

Kinda.

Eventually we needed to move of our git repositories off of our web servers

Today our HTTP servers are distinct from our git servers. The two communicate using smoke

Page 71: Inside GitHub

smoke

“Grit in the cloud”

Instead of reading and writing from the disk, Grit makes Smoke calls

The reading and writing then happens on our file servers

Page 72: Inside GitHub

bert-rpc

Rather than use Protocol Buffers or Thrift or JSON-RPC, Smoke uses BERT-RPC

Page 73: Inside GitHub

bert-rpcbert : erlang ::json javascript:

BERT is an erlang-based protocol

BERT-RPC is really great at dealing with large binariesWhich is a lot of what we do

Page 74: Inside GitHub

bert-rpc

we have four file servers, each running bert-rpc servers

our front ends and job queue make RPC calls to the backend servers

Page 75: Inside GitHub

mojombo / bertrpc

You can grab bert-rpc on GitHub

Page 76: Inside GitHub

mojombo / bertrpc

Or if you just want to play with BERT

Page 77: Inside GitHub

chimney

We have a proprietary library called chimney

It routes the smoke. I know, don’t blame me.

Page 78: Inside GitHub

chimney

All user routes are kept in Redis

Chimney is how our BERT-RPC clients know which server to hit

It falls back to a local cache and auto-detection if Redis is down

Page 79: Inside GitHub

chimney

It can also be told a backend is down.

Optimized for connection refused but in reality that wasn’t the real problem.

Page 80: Inside GitHub

proxymachine

All anonymous git clones hit the front end machines

the git-daemon connects to proxymachine, which uses chimney to proxy yourconnection between the front end machine and the back end machine (which holdsthe actual git repository)

very fast, transparent to you

Page 81: Inside GitHub

mojombo / proxymachine

proxymachine can be used to proxy any kind of tcp connection

open source

Page 82: Inside GitHub

ssh

Sometimes you need to access a repository over ssh

In those instances, you ssh to an fe and we tunnel your connection tothe appropriate backend

To figure that out we use chimney

Page 83: Inside GitHub

jobs

We do a lot of work in the background at GitHub

Page 84: Inside GitHub

resque

Currently we use a system called Resque.

Page 85: Inside GitHub

defunkt / resque

You can grab it on GitHub

Page 86: Inside GitHub

resque

- dealing with pushes- web hooks- creating events in the database- generating GitHub Pages- clearing & warmingcaches- search indexing

Page 87: Inside GitHub

queues

In Resque, a queue is used as both a priority and a localization technique

By localization I mean, “where your workers live”

Page 88: Inside GitHub

queuescritical,high,low

these three run on our front end servers

Resque processes them in this order

Page 89: Inside GitHub

queuespage

GitHub Pages are generated on their own machine using the `page` queue

Page 90: Inside GitHub

queuesarchive

And tarball and zip downloads are created on the fly using the `archive` queue on our archiving machines

Page 91: Inside GitHub

search

On GitHub, you can search code, repositories, and people

Page 92: Inside GitHub

solr

Solr is basically an HTTP interface on top of Lucene. This makes it pretty simpleto use in your code.

We use solr because of its ability to incrementally add documents toan index.

Page 93: Inside GitHub

Here I am searching for my name in source code

Page 94: Inside GitHub

solr

We’ve had some problems making it stable but luckily the guys at Pivotalhave given us some tips

Like bumping the Java heap size.

Whatever that means

Page 95: Inside GitHub

database

Our database story is pretty uninteresting

Page 96: Inside GitHub

mysql

We use mysql 5

Page 97: Inside GitHub

master / slave

All reads and writes go to the master

We use the slave for backups and failover

Page 98: Inside GitHub

caching

On the site we do a ton of caching using memcached

Page 99: Inside GitHub

fragments

We cache chunks of HTML all over

Usually they are invalidated by some action

Page 100: Inside GitHub

fragments

Formerly we invalidated most of our fragments using a generation scheme,where you put a number into a bunch of related keys and increment itwhen you want all those caches to be missed (thus creating new cache entries with fresh data)

Page 101: Inside GitHub

fragments

But we had high cache eviction due to low ram and hardware constraints, and foundthat scheme did more harm than good.

We also noticed some cached data we wanted to remain forever was being evicted due to the slabs with generational keys filling up fast

Page 102: Inside GitHub

page

We cache entire pages using nginx’s memcached module

Lots of HTML, but also other data which gets hit a lot and changes rarely:

Page 103: Inside GitHub

page

- network graph json- participation graph data

Always looking to stick more into page caches

Page 104: Inside GitHub

object

We do basic object caching of ActiveRecord objects such as repositories and users all over the place

Caches are invalidated whenever the objects are saved

Page 105: Inside GitHub

associations

We also cache associations as arrays of IDs

Grab the array, then do a get_multi on its contents to get a list of objects

That way we don’t have to worry about caching stale objects

Page 106: Inside GitHub

walker

We also have a proprietary caching library called Walker

Page 107: Inside GitHub

walker

It originally walked trees and cached them when someone pushed

But now it caches everything related to git:

Page 108: Inside GitHub

walker

- commits- diffs- commit listing- branches- tags- everything

Page 109: Inside GitHub

Every git-related page load hits Walker a lot

Page 110: Inside GitHub

walker

For most big apps, you need to write a caching layerthat knows your business domain

Generic, catch-all caching libraries probably won’t do

Page 111: Inside GitHub

events

An example of this is our events system

Page 112: Inside GitHub
Page 113: Inside GitHub

This is one fragment

Page 114: Inside GitHub

Each of these is a fragment

Page 115: Inside GitHub

They’re also cached as objects

Page 116: Inside GitHub

As well as a list of ids

Page 117: Inside GitHub

And that’s just for the dashboard...

Page 118: Inside GitHub

optimizations

So what other optimizations have we done

Page 119: Inside GitHub

asset servers

Well we do the common trick of serving assets from multiple subdomains

Page 120: Inside GitHub

asset serversassets0.github.comassets1.github.com

and so forth

Page 121: Inside GitHub

sha asset id

Instead of using timestamps for asset ids, which may end up hitting the diskmultiple times on each request, we set the asset id to be the sha of the last commitwhich modified a javascript or css file

Page 122: Inside GitHub

sha asset id/css/bundle.css?197d742e9fdec3f7

/js/bundle.js?197d742e9fdec3f7

Now simple code changes won’t force everyone to re-download the css or js bundles

Page 123: Inside GitHub

bundling

For bundling itself, we use

Page 124: Inside GitHub

bundling

yui’s compressor for css and

Page 125: Inside GitHub

bundling

google’s closure compiler for javascript

we don’t use the most aggressive setting because it means changingyour javascript to appease the compression gods, which we haven’t committed to yet

Page 126: Inside GitHub

scripty 301

Again, for most of these tricks you need to really pay attention to your app.

One example is scriptaculous’ wiki

Page 127: Inside GitHub

scripty 301

When we changed our wiki URL structure, we setup dynamic 301 redirects for the old urls.

Scriptaculous’ old wiki was getting hit so much we put the redirect into nginx itself -this took strain off our web app and made the redirects happen almost instantly

Page 128: Inside GitHub

ajax loading

We also load data in via ajax in many places.

Sometimes a piece of information will just take too long to retrieve

In those instances, we usually load it in with ajax

Page 129: Inside GitHub
Page 130: Inside GitHub
Page 131: Inside GitHub

If Walker sees that it doesn’t have all the information it needs, it kicks off a jobto stick that information in memcached.

Page 132: Inside GitHub

We then periodically hit a URL which checks if the information is in memcached or not. If it is, we get it and rewrite the page with the new information.

Page 133: Inside GitHub

We use this same trick on the Network Graph

Page 134: Inside GitHub

Fork Queue

Page 135: Inside GitHub

ajax loading

and anywhere else it makes sense.

Page 136: Inside GitHub

comet loading

very soon this will all be comet, though

Page 137: Inside GitHub

monitoring

what do we use for monitoring?

Page 138: Inside GitHub

nagios

Our support team monitors the health of our machines and coreservices using nagios.

I don’t really touch the thing.

Page 139: Inside GitHub

Here’s a screenshot from my IE browser, complete with the ICQ plugin

Page 140: Inside GitHub

resque web

We monitor our queue using Resque’s included Sinatra app

Page 141: Inside GitHub
Page 142: Inside GitHub

haystack

We use an in-house app called Haystack to monitor arbitrary information,tracked as JSON.

Page 143: Inside GitHub

Here’s an example of Haystack’s “exceptions” view

Page 144: Inside GitHub

collectd

We also use collectd to monitor load, RAM usage, CPU usage, and otherapp-related metrics

Page 145: Inside GitHub

pingdom

pingdom sends us SMSes when the site is down

it’s nice

Page 146: Inside GitHub

tender

tender is what we use for customer support

Page 147: Inside GitHub

it works incredibly well, and they’re constantly improving it

Page 148: Inside GitHub

testing

Our testing setup is pretty standard

Page 149: Inside GitHub

test unit

We mostly use Ruby’s test/unit.

We’ve experimented with other libraries including test/spec, shoulda, and RSpec, but in the endwe keep coming back to test/unit

Page 150: Inside GitHub

git fixtures

As many of our fixtures are git repositories, we specify in the test what sha we expect to be the HEAD of that fixture.

This means we can completely delete a git repository in one test, then have it back inpristine state in another. We plan to move all our fixtures to a similar git-system in the future.

Page 151: Inside GitHub

ci joe

We use ci joe, a continuous integration server, to run on tests after each push.

He then notifies us if the tests fail.

Page 152: Inside GitHub
Page 153: Inside GitHub
Page 154: Inside GitHub

defunkt / cijoe

You can grab him at github

Page 155: Inside GitHub

staging

We also always deploy the current branch to staging

This means you can be working on your branch, someone else can be working on theirs,and you don’t need to worry about reconciling the two to test out a feature

One of the best parts of Git

Page 156: Inside GitHub

security

Page 157: Inside GitHub

github.com/security

having a security page really helps

Page 158: Inside GitHub

[email protected]

we get weekly emails to our security email (that people find on the security page)

and people are always grateful when we can reassure them or a answer their question

Page 159: Inside GitHub

consultant

if you can, find a security consultant to poke your site for XSS vulnerabilities

having your target audience be developers helps, too

Page 160: Inside GitHub

backups

backups are incredibly important

don’t just make backups: ensure you can restore them, as well

Page 161: Inside GitHub

sql

we keep nightly, off-site backups of our sql databases

Page 162: Inside GitHub

git

and the same for all our git repositories

Page 163: Inside GitHub

Thanks.thanks for coming