High Performance P2P Web Caching

High Performance P2P Web Caching

Erik GarrisonJared Friedman

CS264 PresentationMay 2, 2006

SETI@Home

● Basic Idea: people donate computer time to look for aliens

● Delivered more than 9 million CPU-years● Guinness BWR – largest computation ever● Many other successful projects (BOINC, Google

Compute)● The point: many people are willing to donate

computer resources for a good cause

Wikipedia

● About 200 servers required to keep the site live

● Hosting & Hardware costs over 1$M per year● All revenue from donations● Hard to make ends meet● Other not-for-profit websites in similar

situation

HelpWikipedia@Home

● What if people could donate idle computer resources to help host not-for-profit websites?

● They probably would!● This is the goal of our project

Prior Work

● This doesn't exist● But some things are similar

Content Distribution Networks (Akamai)● Distributed web hosting for big companies

CoralCDN/CoDeeN● P2P web caching, like our idea,● But a very different design● Both have some problems

Akamai, the opportunity

● Internet traffic is 'bursty'● Expensive to build infrastructure to handle

flash crowds● International audience, local servers

Sites run slowly in other countries

Akamai, how it works

● Akamai put >10,000 servers around the globe

● Companies subscribe as Akamai clients● Client content (mostly images, other media)

is cached on Akamai's servers● Tricks with DNS make viewers download

content from nearby Akamai servers● Result: Website runs fast everywhere, no

worries about flash crowds● But VERY expensive!

CoralCDN

● P2P web caching ● Probably the closest system to our goal● Currently in late-stage testing on PlanetLab● Uses an overlay and a 'distributed sloppy

hash table'● Very easy to use – just append '.nyud.net' to

a URL and Coral handles it● Unfortunately ...

Coral: Problems

● Currently very slow This might improve in later versions Or it might be due to the overlay structure

● Security: volunteer nodes can respond with fake data

● Any site can use Coral to help reduce load Just append .nyud.net to their internal links

● Decentralization makes optimization hard more on this later

Our Design Goals

● Fast: Akamai level performance● Secure: Pages served are always genuine● Fast updates possible● Must greatly reduce demands on main site

But this cannot compromise first 3

Our Design

● Node/Supernode structure Take advantage of extremely heterogeneous

performance characteristics● Custom DNS server redirects incoming

requests to nearby super node● Super node forwards request to nearby

ordinary node● Node replies to user

Our DesignUser goes to wikipedia.org

DNS server resolves wikipedia.org to a super node

Super node forwards request toordinary node that has the requested document

Node retrieves document and sends to user

Performance

● Requests are answered in only 2 hops● DNS server resolves to a geographically

close supernode● Supernode avoids sending requests to slow

or overloaded nodes● All parts of a page (e.g., html and images)

should be served by a single node

Security

● Have to check nodes' accuracy● First line of defense: encrypt local content● May delay attacks, but won't stop them

Security

● More serious defense: let users check the volunteer nodes!

● Add a javascript wrapper to the website that requests the pages using AJAX

● With some probability, the AJAX script will compute the MD5 of the page it got and send it to a trusted central node

● Central node kicks out nodes that frequently get invalid MD5sum's

● Offload processing not just to nodes, but to users, with zero-install

A Tricky Part

● Supernodes get requests, have to decide what node should answer what requests

● Have to load-balance nodes – no overloading● Popular documents should be replicated

across many nodes● But don't want to replicate unpopular

documents much – conserve storage space● Lots of conflicting goals!

On the plus side...

● Unlike Coral & CoDeeN, supernodes know a lot of nodes (maybe 100-1000?)

● They can track performance characteristics of each node

● Make object placement decisions from a central point

● Lots of opportunity to make really intelligent decisions Better use of resources Higher total system capacity Faster response times

Object Placement Problem

● This kind of problem is known as an object placement problem “What nodes do we put what files on?”

● Also related to the request routing problem “Given the files currently on the nodes, what

node do we send this particular request to?”● These problems are basically unsolved for

our scenario● Analytical solutions have been done for very

simplified, somewhat different cases● We suspect a useful analytic solution is

impossible here

Simulation

● Too hard to solve analytically, so do a simulation

● Goal is to explore different object placement algorithms under realistic scenarios

● Also want to model the performance of the whole system What cache hit ratios can we get? How does number/quality of peers affect cache

hit ratios? How is user latency affected?

● Built a pretty involved simulation in Erlang

Simulation Results

● So far, encouraging!● Main results using a heuristic object

placement algorithm● Can load-balance without creating hotspots

up to about 90% of theoretical capacity● Documents rarely requested more than once

from central server● Close to theoretical optimum

Next Steps

● Add more detail to simulation Node churn Better internet topology

● Explore update strategies● Obviously, an actual implementation would

be nice, but not likely to happen this week● What do you think?