Top Banner
arXiv:1602.03273v3 [cs.DC] 25 May 2016 YTrace: End-to-end Performance Diagnosis in Large Cloud and Content Providers Partha Kanuparthy , Yuchen Dai , Sudhir Pathak , Sambit Samal , Theophilus Benson § , Mojgan Ghasemi ,P. P. S. Narayan Yahoo Research Yahoo Inc. § Duke University Princeton University ABSTRACT Content providers build serving stacks to deliver content to users. An important goal of a content provider is to ensure good user experience, since user experience has an impact on revenue. In this paper, we describe a sys- tem at Yahoo called YTrace that diagnoses bad user ex- perience in near real time. We present the different com- ponents of YTrace for end-to-end multi-layer diagnosis (instrumentation, methods and backend system), and the system architecture for delivering diagnosis in near real time across all user sessions at Yahoo. YTrace diag- noses problems across service and network layers in the end-to-end path spanning user host, Internet, CDN and the datacenters, and has three diagnosis goals: detec- tion, localization and root cause analysis (including cas- cading problems) of performance problems in user ses- sions with the cloud. The key component of the methods in YTrace is capturing and discovering causality, which we design based on a mix of instrumentation API, do- main knowledge and blackbox methods. We show three case studies from production that span a large-scale dis- tributed storage system, a datacenter-wide network, and an end-to-end video serving stack at Yahoo. We end by listing a number of open directions for performance di- agnosis in cloud and content providers. 1. INTRODUCTION Large content providers such as Yahoo, Google, Net- flix and Facebook serve users from large-scale serving stacks in geographically distributed datacenters on the Internet. They can be modeled as cloud infrastructure that consists of multiple datacenters and a Content Dis- tribution Network (CDN) (Figure 1). Users interact with the content provider by making RPCs (also called user sessions) to the CDN and the datacenters. The user ex- perience of a user session with the provider depends on several factors from the serving stack, to the datacen- ter network and the Internet, to the content. Bad user experiences result in loss of users and revenue [1]. Content providers build for good user experience by building high-performance serving stacks and network Figure 1: Model of a large content provider showing the end-to-end path for a user session. Lower figure shows canonical execution graph to determine instrumentation; “S” and “NW” represent services and network respectively. infrastructure. Serving stacks are compositions of ser- vices, and services are usually large distributed systems comprising of hundreds to thousands of hosts – on top of the datacenter network and inter-datacenter wide area paths. Serving stacks include latency-tolerant distributed execution techniques such as parallelism and redundancy [11]. For example, a user request for a personalized web page could be served by “assembling” parts of the page, each generated by a service 1 . In order to do this, services (specifically, hosts in a service) make RPCs to each other over the underlying network paths. Due to the composition scale and heterogeneity of a serving stack, it is prone to performance problems that span multiple layers – from the infrastructure layer such as network and servers, to the higher layers such as the OS, containers and service processes within a server, to the distributed systems layer – and localized among nodes in the end-to-end path (Figure 1). Detecting and troubleshooting bad user experience is a complex and tedious problem at scale, since it often involves mul- tiple services and layers, and hence, coordination be- tween multiple teams across service tiers and underly- ing layers. It is hence equally important to build systems that continuously monitor and diagnose bad user expe- riences. Such systems help troubleshoot to quickly fix performance problems, and know where to allocate re- sources in the medium-term. Further, near real time di- 1 Such designs are also called service-oriented and microser- vices architectures.
14

, Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

Jan 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

arX

iv:1

602.

0327

3v3

[cs.

DC

] 25

May

201

6

YTrace: End-to-end Performance Diagnosis inLarge Cloud and Content Providers

Partha Kanuparthy‡, Yuchen Dai†, Sudhir Pathak†, Sambit Samal†,Theophilus Benson§, Mojgan Ghasemi⊥, P. P. S. Narayan†

‡ Yahoo Research † Yahoo Inc. § Duke University ⊥ Princeton University

ABSTRACTContent providers build serving stacks to deliver contentto users. An important goal of a content provider is toensuregood user experience, since user experience hasan impact on revenue. In this paper, we describe a sys-tem at Yahoo called YTrace that diagnoses bad user ex-perience in near real time. We present the different com-ponents of YTrace for end-to-end multi-layer diagnosis(instrumentation, methods and backend system), and thesystem architecture for delivering diagnosis in near realtime across all user sessions at Yahoo. YTrace diag-noses problems across service and network layers in theend-to-end path spanning user host, Internet, CDN andthe datacenters, and has three diagnosis goals: detec-tion, localization and root cause analysis (including cas-cading problems) of performance problems in user ses-sions with the cloud. The key component of the methodsin YTrace is capturing and discovering causality, whichwe design based on a mix of instrumentation API, do-main knowledge and blackbox methods. We show threecase studies from production that span a large-scale dis-tributed storage system, a datacenter-wide network, andan end-to-end video serving stack at Yahoo. We end bylisting a number of open directions for performance di-agnosis in cloud and content providers.

1. INTRODUCTIONLarge content providers such as Yahoo, Google, Net-

flix and Facebook serve users from large-scale servingstacks in geographically distributed datacenters on theInternet. They can be modeled as cloud infrastructurethat consists of multiple datacenters and a Content Dis-tribution Network (CDN) (Figure1). Users interact withthe content provider by making RPCs (also calledusersessions) to the CDN and the datacenters. Theuser ex-perienceof a user session with the provider depends onseveral factors from the serving stack, to the datacen-ter network and the Internet, to the content.Bad userexperiences result in loss of users and revenue [1].

Content providers build for good user experience bybuilding high-performance serving stacks and network

Figure 1: Model of a large content provider showing the end-to-endpath for a user session. Lower figure shows canonical execution graphto determine instrumentation; “S” and “NW” represent services andnetwork respectively.

infrastructure. Serving stacks are compositions of ser-vices, and services are usually large distributed systemscomprising of hundreds to thousands of hosts – on topof the datacenter network and inter-datacenter wide areapaths. Serving stacks include latency-tolerant distributedexecution techniques such as parallelism and redundancy[11]. For example, a user request for a personalizedweb page could be served by “assembling” parts of thepage, each generated by a service1. In order to do this,services (specifically, hosts in a service) make RPCs toeach other over the underlying network paths.

Due to the composition scale and heterogeneity of aserving stack, it is prone to performance problems thatspan multiple layers – from the infrastructure layer suchas network and servers, to the higher layers such as theOS, containers and service processes within a server,to the distributed systems layer – and localized amongnodes in the end-to-end path (Figure1). Detecting andtroubleshooting bad user experience is a complex andtedious problem at scale, since it often involves mul-tiple services and layers, and hence, coordination be-tween multiple teams across service tiers and underly-ing layers. It is hence equally important to build systemsthat continuously monitor and diagnose bad user expe-riences. Such systems help troubleshoot to quickly fixperformance problems, and know where to allocate re-sources in the medium-term. Further, near real timedi-

1Such designs are also called service-oriented and microser-vices architectures.

Page 2: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

agnosis as a serviceis a useful primitive to optimize ex-isting systems against performance problems. Contentproviders have designed and deployed several systemsin practice [9, 19, 20, 30, 32, 34, 38, 41]; however, thesesystems do not diagnose performance problems end-to-end and across layers.

We present YTrace, a system that we are buildingat Yahoo to diagnose end-to-end performance problemsthat impact user experience. YTrace has three compo-nents: instrumentation to collect data, diagnosis meth-ods that run on the data, and an efficient backend to in-dex the data and execute diagnosis queries (Figure2).In this paper, we focus on the first two components andtouch upon the third. We consider dynamic web contentthat is tailored for users – perhaps the most commonon the Internet. Our definition of user experience de-pends on the content type: for web content, we estimateuser experience as the page load time – the latency be-tween the user’s content request and the Javascript On-Load event; and for video streams, we consider durationof rebuffering events. Our work can easily be extendedto diagnose performance problems with other contenttypes and definitions of user experience.

When building an end-to-end diagnosis system, thereare key requirements for large content providers:• Tie to user experience: Instrumentation and diagnoses

should directly relate to user experience of real users.• The diagnosis output should be general enough to help

troubleshootalmost allperformance problems, includ-ing casding failures.

• Multi-layer: The diagnosis should span as many lay-ers in the serving stack as possible. At a minimum, itshould include all services, the host machines and theunderlying network layer.

• Instrumentation should have low overhead, so it doesnot affect the user experience.

• Accuracy: Diagnosis should have low false positiveand false negative rates for the use cases. It should beable to diagnose tail latency.The key ideas behind YTrace rely on identifying con-

current event execution, both at the service level andin the network. Knowing the context of concurrencyenables YTrace to compute the most important infor-mation for diagnoses – the critical path in the execu-tion. In order to find concurrency, YTrace records andmines causal relationships between events in a user ses-sion at the service and network layers. It aggregates di-agnoses across user sessions and renders an interactivedashboard geared towards troubleshooting.

2. PROBLEM STATEMENTThere are three broad classes of use cases of YTrace:

troubleshooting, resource provisioning and service adap-

tation. Troubleshooting aims to fix performance prob-lems that users face after the problems occur. It requiresthe system to deliver near real time, actionable, insightsinto performance problems. Resource provisioning isa relatively longer-term task that involves querying thesystem for aggregate views of diagnosis to find whereto add resources2. Service adaptation uses YTrace asa near real time diagnosis-as-a-service to build servingstacks that optimize for user experience. For example,the traffic engineering service at a CDN may route usersto CDN nodes based on diagnoses of Internet paths; therate adaption module in a video player may make strate-gic rate choices if it had diagnoses. Since this involvespre-defined queries, the system may materialize suchqueries to minimize query times3.

Based on discussions with teams across Yahoo, weformulate a problem statement whose solution providesactionable input for the three use cases. YTrace hasthree goals for every user session:

Detection: Is a user session seeing a performance prob-lem?

Localization: Where are the performance problems inthe end-to-end path (and across all layers)?

Root cause analysis:Why are the performance prob-lems occurring?

In addition to per-session diagnoses, the YTrace back-end supports (and materializes views of) aggregate queriesover multiple user sessions, such as clusters of users(e.g., ISP and geography), and over a service in the dat-acenter in a time window. Aggregate queries with suchpredicates enable statistically significant analyses whileconditioning on confounding variables.

3. INSTRUMENTATIONThe first step towards performance diagnosis of a user

session is instrumentation of components that partici-pate in the session. The instrumentation should not addsignificant latency to the session. The key is to deter-minenecessary and sufficientinstrumentation for diag-nosis. We implement optimized libraries for instrumen-tation so that the instrumentation overhead is very lowrelative to end-to-end latency.

One way to determine instrumentation is by consider-ing the canonical end-to-end user session graph, whosenodes are components (which impact user experience)that participate in user sessions and whose edges rep-resent point-to-point communication between nodes; it

2Resource provisioning also requires answers to “what-if”questions. This is outside the scope of our current work.3Large query delays can be detrimental to performance, e.g.,in load balancing [24].

Page 3: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

Dashboards

Data access

Analytics Federation + caching

Service

YTrace LibraryStreamingTransport

Service

YTrace Library

TraceIndexes

Service

YTrace Library

Service

YTrace Library Datacenters

ETL + Analytics

Service

YTrace LibraryStreamingTransport

Service

YTrace LibraryCDN

Browser

Instrumentation

Content Users

Browser

Instrumentation

Browser

Instrumentation

Browser

Instrumentation

Figure 2: YTrace architecture and components.

should cover all components and layers that are neces-sary for diagnosis. Figure1 shows the graph for largecontent providers that spans: the user end-host, the CDN,the serving stack spanning one or more datacenters, andthe underlying network infrastructure.

In order to diagnose performance problems with eachnode in the canonical graph, the necessary and sufficientinstrumentation will include performance data from ev-ery node in the graph (necessary condition) and will notinclude redundant instrumentation between edges (suffi-cient). The necessary and sufficient instrumentation forroot cause analysis at a node depends on the attributesYTrace needs to be able to fingerprint and match prob-lem signatures (§4).

YTrace includes two forms of instrumentation: (1)synchronous instrumentation that is in-band with the usersession, and (2) asynchronous instrumentation from com-ponents that cannot be modified for instrumentation (suchas network devices). We implement synchronous instru-mentation in the form ofdistributed tracing, which al-lows YTrace to tie performance of any component intothe user experience. YTrace uses causal relationships ininstrumentation data to diagnose performance problems.

3.1 Synchronous InstrumentationUser-side instrumentation. User end-host instrumen-tation enables YTrace to diagnose performance prob-lems with events at the end-host (includes browser, anycontainers, and the content itself). In general, contentproviders cannot alter the browser (e.g., by introducingplugins), which leaves them with a limited set of user-side performance measurements.

The work of content in a browser can be modeled asa sequence of events spanning fetching resources (eithervia local cache or network), execution and rendering.The W3C Navigation Timing (NavTiming) recommen-dation [2] describes such an event model for origin con-tent (the page HTML) and exposes it via a JavascriptAPI to the web page.4

4A similar event API for the other resources that the page re-

A BC

e1e1e2e2

r2r2

r1r1

Concurrency: e1 ! e2e1 ! e2

B CClient A

e1e1

e2e2

r1r1

r2r2

r0r0

Redundancy: r1 → r0r1 → r0, r2 ! r0r2 ! r0

Cf (e2)Cf (e2)Cl(e2)Cl(e2)

Cf (r2)Cf (r2)Cl(r2)Cl(r2)

Figure 3: Inferring causality in RPC patterns.

We use the NavTiming event model for user-side in-strumentation for web content in YTrace. This enablesus to break down the user experience (page load time)into timing of events for origin content. There is a causalrelationship between all NavTiming events: for exam-ple, DNS lookup (if any) causes TCP connect to theCDN. In particular, all events measured by NavTimingand Resource Timing have well-defined causal relation-ships. Having causal relationships helps us understandthe events that resulted in bad user experience, sincethey are necessary to construct the latency critical pathof the session.

In the case of video streaming (which is a linearizablesequence of RPCs per-segment to the CDN), YTraceuses an event model that includes timing of per-segmentevents at the video player. These events span segmentRPCs, and decoding and rendering of those segments tothe screen.

Distributed tracing. YTrace synchronously traces usersessions (i.e., RPCs from user agent) through all exe-cution nodes in the serving stack, including the CDNservices and the user host (see user-side instrumenta-tion above). Distributed tracing is a common monitor-ing primitive in large-scale serving stacks [9, 32], andinvolves two steps: assigning a globally unique ID to auser session, and propagating this ID through all nodesin the serving stack. The ID propagation is typically

quires is supported by the W3C Resource Timing recommen-dation [3].

Page 4: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

implemented by adding the ID to all RPCs during ses-sion execution – for example, in the form of a serial-ized header. YTrace records the timing of events relatedto each RPC during session execution using node-localclocks.

When implementing distributed tracing, there are afew design constraints that arise from large-scale envi-ronments. Such environments are highly heterogeneous,not only in the platforms used, but also in runtime com-plexity such as RPC execution patterns (see Figure3),serialization formats and protocols. We find two formsof RPC-level concurrency in distributed execution: par-allelism and redundancy – both in the context of RPC“fanout” implementations. Parallelism includes parallelRPCs; the opposite of which is serialized RPC execu-tion. Redundancy is a case where a service doing thefanout only waits for a few responses before sendingback a response to the caller; this is typically used insearch engine stacks.

Perhaps the most important requirement in distributedtracing implementations is tocapture causalityamongall events in the end-to-end execution. Ideally, causal-ity among events should be described by the servicesthemselves (during tracing), and not inferred offline us-ing tracing data (an approach adopted by prior work [5,9]). The reason for this is dynamic behavior in webservices: for example, the concurrent/serialized and re-dundant RPC execution patterns shown in Figure3 canbe triggered as functions of session attributes, perfor-mance history and runtime environment. Such dynam-icity makes offline inference of causal relationships be-tween RPC edges hard (without additional data that maynot be within the scope of a tracing system).

YTrace captures causality between RPC “edges” (re-quests and responses), since causality in RPC executionpatterns such as concurrency and redundancy exists be-tween RPC edges. This RPC model is a key differencebetween YTrace and prior tracing systems such as Dap-per (and Zipkin), which capture causality at the granu-larity of RPCs. YTrace captures causality in two forms:during tracing using service-level APIs designed to cap-ture causality, and offline using (well-defined) happens-before relationships [21].

Services call the YTrace tracing API at each RPC edge.The API returns an immutable session context (passedas a handle) for each incoming RPC, until that RPC isfully served. The API times each RPC edge, and anyannotations5 across the session. It consumes and returnsall headers that are/should be serialized in RPCs.

void ∗handle = create ( /∗ S t r i n g ∗ / in_header ) ;

5The annotations are optional service-specific timestampedkey-value pairs, such as lock events. They are used by de-velopers to understand service-specific performance, suchasimpact of lock contention on end-to-end performance.

Network localization and root causes

DetectionService localization

Causality, critical paths

Sessions

Critical path RPCs

Code localizationProfile matching

Critical paths

Root cause analysisPathology rule matching

Critical paths

Instrumentation(input to all blocks)

Syslog parsing and

async. learning

Syslog

diagnosis

Causal rulesUser host and

Internet DiagnosisTCP models

Sessions

Figure 4: YTrace diagnosis flow.

String out_header = sendtonext (handle ) ;recvfromnext (handle , /∗ S t r i n g∗ / in_header ) ;String out_header = sendtoprev (handle ) ;annotate (handle , /∗ S t r i n g∗ /key , /∗ S t r i n g∗ /←֓

val ) ;close (handle ) ;

The API captures causality in two ways. First, the ses-sion context (handle) allows YTrace to capture parent-child relationships between RPC edges. Second, theheaders that the API generates include an RPC ID andparent RPC ID (that are unique to the session), and theseIDs record causality between the request and response(s)(if any) of an RPC. The parent-child RPC IDs also recordglobal ordering of RPCs in the session.

We implement the YTrace tracing API as a userspacelibrary that services call during execution. The libraryprovides a 99th percentile runtime SLA of 3µs per RPC.The low runtime overhead of the library is due to tworeasons. First, the APIs are stateless, since the sessionstate (handle) in a service is immutable. The state, in-stead, is passed over the network in the serialized head-ers; an example of such state is the child-parent RPCIDs in the session. This design choice trades-off expen-sive state maintenance in the API with a few additionalbytes on the network; and also avoids any need for syn-chronization at session-level in a service. Second, alllogging in YTrace is asynchronous.

The API captures a significant set of causal relation-ships in sessions, but it does not capture causality (orlack of it) in RPC execution patterns such as parallelismand redundancy in the execution graph [9, 11]; see Fig-ure3 – we found that doing so makes the API complex(which slows down adoption). YTrace uses happens-before relationships to infer this causality (see§4).

CDN instrumentation. The wide area Internet path canbe a significant source of performance problems that im-pact user experience. In order to diagnose these prob-lems and their impact on user experience, we need toinstrument the Internet path in isolation. Typical ap-proaches explored in literature include active probing(which is asynchronous) or model-based methods that

Page 5: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

infer path performance user-side and/or CDN-side mea-surements of the trace. Both approaches have limita-tions: the former adds network traffic and may not becausally related to the session (since it is asynchronous),while the latter methods are prone to user host problems(which are not uncommon).

In order to diagnose and isolate Internet performanceproblems, we take periodic snapshots of measurementsthat the TCP stack in the CDN kernel maintains, for theTCP connection used by user host RPCs. Specifically,we snapshottcp_info structures from the Linux ker-nel. The structure contains end-to-end statistics of theTCP connection that affect serving performance, such aspacket retransmissions, reordering, RTT, sender and re-ceiver windows, etc.; it hence measures the Internet pathas the flow sampled it. YTrace uses the TCP connectionstatistics to localize throughput bottlenecks to sender(content generation), receiver and path-based limitations;and to diagnose download bottlenecks with user hosts.Note that we do not include diagnosis of CDN trafficengineering-relatedperformance problems; in other words,YTrace’s diagnosis is conditioned on the traffic engi-neering decision for a user session (see discussion, §8).

Process profiling. We are adopting Continuous Pro-filing [6, 30] in Yahoo services for a small fraction ofuser sessions. At a high level, Continuous Profiling col-lects performance counters exposed by modern CPUs.Performance counters allow us to understand host-levelbottlenecks and localize bad user experience down tocode using the associated program counters (in conjuc-tion with the process binaries).

At the end of each session, YTrace records a directedtrace graph that includes:(i) the end-to-end executiongraph with compute, serialization and RPC timings ateach node and event causality,(ii) user-side event tim-ings and event causality, and(iii) TCP-layer measure-ments of user-side Internet at the CDN.

3.2 Asynchronous InstrumentationDespite service stack-level concurrency, the underly-

ing network is a shared resource and typically under-provisioned (e.g., fat-tree datacenter topologies). Thenetwork can introduce performance problems in RPCsduring session execution, which can impact user experi-ence. YTrace collects asnchronous instrumentation fromcomponents in the end-to-end serving stack that cannotbe modified to do tracing. Such components are typi-cally in the underlying network layer, such as the data-center network devices.

YTrace collectssyslogsfrom all datacenter networkdevices. Syslogs include detailed and fine-grained stateinformation of each device and the root cause. In con-junction with syslogs, YTrace uses network topology

to localize user session performance problems to net-work devices. It collects network topology snapshots of:(i) the wide area Internet paths from the CDN toclientclustersand to the datacenters using traceroutes, and(ii)each datacenter network using device configurations.

4. DIAGNOSISYTrace uses synchronous and asynchronous instru-

mentation to diagnose performance problems that im-pact user experience. In this section, we sketch the di-agnosis methods. See Figure4 for a flow overview.

Detection.The first step towards performance diagnosisis to detect performance problems. Since our focus is onuser experience, we frame the detection problem aroundit: Is the page load time6 large for the user session?YTrace answers this question byestimating a baseline(normal behavior) for the page load time based on his-tory, and finding if the session has a statistically signif-icant deviation from the baseline. The page load timeis measured at the user-side. Note that detection algo-rithms have to be aware of confounding variables suchas the web page (session) attributes, the user attributesand time of day; YTrace conditions the detection basedon domain knowledge of pre-defined confounding vari-ables. YTrace also supports detection based on otherdefinitions of user experience, or not based on user expe-rience. For example, video quality of experience, abnor-mal service latencies or unusual execution graphs for asession. YTrace currently estimates a simple baseline asthe historic inter-quartile range, since we are interestedin understanding performance behind both low and highuser experience metrics.

4.1 Content DiagnosisFor sessions that were detected as having performance

problems, YTrace uses the user-side instrumentation todetermine whether there were performance problems thatwere localized to the user agent (browser). To do this, itchecks whether the latency of user-side events from theNavigation Timing API [2] (e.g., DOM processing andrendering) are significant relative to the OnLoad timefor the page. Note that NavTiming events are causallyrelated and the critical path of user-side events includesall events.

In general, resources on the page are fetched (andexecuted) concurrently with the origin page – and theconcurrency depends on the origin content, ordering ofresource arrivals and execution latency (parsing, DOMconstruction, etc.). This dependency between resourcesleads to blocking periods; however, such analysis re-quires browser modifications [36]. YTrace currently treats6We can ask similar detection questions about user experiencemetrics for other content types, e.g., rebuffering in video.

Page 6: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

all content (origin and resources) performance as inde-pendent (note that origin content is typically the dy-namic, non-cacheable content in personalized pages).We are investigating dependency and blocking time mea-surement that avoids browser changes.

4.2 Service DiagnosisService localization refers to the question of which

services in the session execution graphcauseda perfor-mance problem for sessions that were detected as havingperformance problems. YTrace localizes service-levelproblems by estimating the critical path(s) – in terms ofservice latencies – in the session execution graph. In or-der to compute the critical path, YTrace needs contextof concurrency in the execution graph, which is deter-mined by causality between RPC edges.

YTrace tracks causality as follows. The synchronousinstrumentation system tracks two forms of RPC causal-ity during tracing: parent-child RPC relationships, andrequest-response causality. In order to keep the tracingAPI simple to use and reduce usage errors, we do nottrack (at the API-level) causalitybetweensibling RPCedges at a single node. Causality between sibling RPCedges may (or may not) exist, depending on the RPCexecution patterns used (see Figure3). Since such pat-terns are typically dynamic, based on session and en-vironment attributes, we cannot use blackbox methodsthat mine causal relationships from offline session tracedata [5, 9].

The parallelism and redundancy RPC patterns lendwell to happens-before relationships (directly followsfrom their definition). Consider outgoing RPC edgese1 . . . en at a service nodeA (Figure3); and let the re-sponses to the outgoing edges be the edgesr1 . . . rn (in-cident onA). Denote the first byte and last byte times-tamps (A’s wall clock) of an edgee byCf (e) andCl(e).Timestamps are taken from user space. Note thateicausesri for all i, denoted asei → ri∀i.Causality innon-parallelism: ri → ej: Cf (ri) < Cf (ej);ri 9 ej otherwise.Redundancy:If edge r0 is A’s reply to calling node,ri → r0 ∀ ri: Cf (ri) < Cl(r0); ri 9 r0 otherwise.

YTrace uses edge causality to estimate the critical pathin the execution graph, defined as thecausalround trippath in the graph with the largest total (service and net-work) latency. Latency at a service with an incomingedgee0 and a causal outgoing edgee1 is the compu-tation time: ls01 = Cf (e1) − Cl(e0). The network la-tency is the RPC edge (de)serialization time at the callernode. Note that the critical path in the same executiongraph can change based on RPC causality: for example,if serviceA makes two RPCse1 ande2 toB, the criticalpath may include one or both ofe1 ande2 depending one1 − e2 causality:

A B

e0e0 e1e1

e2e2

r1r1

r2r2r0r0

e0e0

e1e1

e2e2 r2r2

r1r1

r0r0

YTrace estimates the contribution of a service as thesum of computation latencies for all incoming-causaloutgoing edge pairslij . It reports the service-level local-ization output as the top service contributors, and theirfraction of end-to-end (user-side) latency, amongst ser-vices in the critical path.

4.3 Network DiagnosisYTrace uses syslogs from network devices to diag-

nose datacenter network problems. It uses TCP stackmeasurements at the CDN to isolate problems on theuser-to-CDN Internet path (note that we cannot instru-ment the Internet path). We first look at datacenter net-work problem diagnosis.Datacenter network diagnosis.The critical path foundduring service localization represents the subset of ex-ecution that contributed to user latency, and it includestime spent by RPCs in the network. The network candegrade the performance of RPCs by inducing latency,packet losses and reordering, which increase the RPCtime and reduce throughput (especially for RPCs withlarge payloads). Our goal in network diagnosis is to lo-calize and find root causes of datacenter network prob-lems. We are interested in localizing cascading prob-lems and finding root causes that propagate across thenetwork stack; for example, a hardware problem in aswitch that cascades into problems in the connected routeras both L2 and routing plane problems.

YTrace uses syslogs and the datacenter network topol-ogy to diagnose cascading problems. Each datacenternetwork device emits a stream of syslog messages, whichare semi-structured text that include a timestamp, sever-ity level and semantics of the problem (network inter-face, problem type and attributes, etc.). Our goal isto represent a problem as a structured graph that de-scribes the causal activity (the cascade) in the problem.It uses domain knowledge to preprocess syslogs: map-ping them to structured “templates” (including equiva-lence classes of problem types), and extracting deviceattributes (if any). We leverage some of the prior workon template extraction [29]. The domain knowledge isa one-time input to YTrace and does not need changesunless the syslog templates change (e.g., due to vendoror major OS changes, which are infrequent).

The first step towards diagnosing session performancedue to network problems is to find RPCs in the ses-sion critical path that impacted user experience. YTrace

Page 7: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

Figure 5: Example network problem graphs from our datacenter.

computes a candidate list of RPCs per-session as fol-lows. Consider two servicesA andB, and an RPCefrom A. The local clock at nodeA is CA

f (we considertimestamps at the start of each RPC request/response toavoid self-loading effects). For diagnosis, we are inter-ested in the variable component of the one-way delay –typically queueing delays. YTrace estimates queueingdelay during the RPC asdAB(e) = CB

f (e) − CAf (e) −

min∆(dAB), where themin is taken over the recent∆time window7; the min is an estimate of the constantcomponents of the one-way delay. YTrace detects anRPC as having a performance problem if the queueingdelay is significant relative to the end-to-end latency. Itthen computes a set of network devices that the RPCcould have traversed.

The typical way to localize network-level performanceproblems is boolean network tomography [12] on end-to-end observations of RPC queueing delays. Tomog-raphy takes as input observations of paths – good andbad – thatoverlapwith a problem path. It aims to iso-late the part of the network that led to the bad observa-tions. Tomography is not directly applicable in datacen-ter networks, since such networks use multi-path rout-ing – hence, the path an RPC takes is not deterministic.This makes the problem combinatorial and expensive tosolve. Syslogs provide a single network-wide solutionthat addresses both localization and root cause analysis.

YTrace’s network diagnosis module has two compo-nents: real time problem graph mining that ingests allsyslogs from a datacenter, and asynchronous low-volumelearning that periodically generatescausal rulesas inputfor the real time component.

More formally, a problem graph is a directed graphof syslog templates, where an edgeTi → Tj impliesthat templateTi caused templateTj. A problem graphcould exist within a single device or span multiple de-vices. A causal rule connects two templates by a causalrelationship:Ti → Tj; depending on whetherTi and

7The time window should be large to include queue dissipa-tion (µs in datacenter networks) but smaller than clock drifttimescales (mins.).

Tj happen within a single device or different devices, acausal rule could be either intra-device or inter-device.For example, Figure5 shows two instances of problemsfrom a Yahoo datacenter (colors encode layer in proto-col stack). The left graph shows a multi-layer problemthat spans aggregate and top-of-rack tiers in the fat-treenetwork, and multiple layers in the protocol stack. Itencodes a cascading problem: a module failure causesa link down event, which triggers a spanning tree proto-col status change, and causing an interface status changeon a peering device. The right graph shows a prob-lem within a top-of-rack devices that is an Ethernet (L2)flapping issue.

YTrace’s diagnosis module mines problem graphs asfollows. It divides the syslog timeline across the data-center into small time windows. Within each time win-dow, it maps syslog lines into templates and uses thecorpus of causal rules to iteratively construct problemgraphs, starting from intra-device edges and then addinginter-device edges. At any point of time, we typicallyhave 100-200 causal rules. Hence, the runtime overheadof mining problem graphs in a small time window ofsyslog messages across the datacenter is relatively low(it can run on a single machine).

The problem of mining causal relationships betweensyslog templates is relatively harder, since it is the prob-lem of finding needles in a haystack of syslogs. In suchcases, happens-before relationships result in significantfalse positive rates. We adopt statistical causality miningtechniques to discover causal rules – in particular, weuse Quasi Experimental Design (QED). First, we find(in a larger time window) template pairs that have a sta-tistically significant correlation in their timeseries. Foreach template pairTi, Tj that is correlated, QED findscausality by testing the hypothesis that an element ofthe treatedset is much more likely than an element ofthe untreatedset. The treated set consists of instanceswhenTi andTj exists together at any time; while theuntreated set has instances whenTi exists but notTj.If the treated set is more likely, QED assigns a causalrelationshipTi → Tj.

For each RPC in a session that is detected as havinga performance problem, YTrace summarizes the set ofproblem graphs on devices that the RPC could have tra-versed. At this point, the network diagnosis in YTraceis meant to showpossibleproblems in the network thatimpact an RPC, since these problems may not manifestas performance problems in all RPCs. We are workingon methods to establish causal relationship between anetwork problem and RPC performance. A limitation ofsyslog-based diagnosis is that it will only mine problemsthat syslogs can describe. We believe that our syslog-mining methodology can be applied on logs from anymulti-layer distributed service. We refer the reader to

Page 8: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

our prior work [22] for details of the network diagnosismethods.Internet path diagnosis.YTrace synchronously instru-ments the user-to-CDN (and user-to-datacenter) path.In the context of Internet path performance, it capturesuserspace RPC timing at the user host, and RPC timingat the CDN and TCP stack measurements of the RPC atthe CDN node. In practice, we observed that a commonsource of performance problems is the user host. Hence,the measurements taken from the browser (or any con-tainer on top) include a mix of problems in the user hostand the Internet path (even after we measure and ac-count for CDN-side latencies). We use measurementsfrom the TCP stack in the CDN host kernel as estimatorsof the Internet path performance (as sampled by TCP).The TCP measurements include path RTT, RTT varia-tion, segment retransmissions, congestion windows andreordering. YTrace uses the TCP measurements to esti-mate the impact of the Internet path on RPCs from userhost, and isolate Internet problems from the user hostperformance. We are looking into using tomography onthe TCP measurements to localize bottlenecks on the In-ternet (in conjunction with topology measurements).

4.4 Ongoing Work

Process localization.A part of our localization goal isto simplify performance debugging by localizing usersession performance problems to source code. One ap-proach requires YTrace to track performance countersfor processes within each service and associate the coun-ters with code; and it has to be low-overhead. The per-formance counters provide a context for fingerprintingruntime behavior of code (for node-local root cause anal-ysis), and include program counters that help associatewith code. This is early-stage work.

Root cause analysis.Root cause analysis in operationalpractice typically relies on fingerprinting performanceproblems based on domain knowledge and experience.While YTrace includes root cause analysis of networkproblems using syslogs, an open question is how to in-corporate service and network operator inputs (domainknowledge) to do service-level root cause analysis. Thekey to this is to provide a suitable model of performanceproblems that operators can input, using the followinggrammar:

SYMPTOM symptomPATHOLOGY pathology DEF ( symptom | NOT ←֓

symptom )symptom := symptom1 AND symptom2symptom := symptom1 OR symptom2symptom := ( symptom )PROCEDURE symptom funcname

We model a performance problem as a boolean-valuedexpression on one or more boolean-valuedsymptoms.

A symptom is a function of instrumentation. For eachdetected problem, we evaluates matching problem ex-pressions to identify the root cause(s). The root causeanalysis is based on prior work on network root causeanalysis [17].

5. YTRACE BACKENDA key aspect of YTrace is a high-performance ana-

lytics backend that enables near real time and accuratediagnoses. Figure2 shows an overview of the back-end. The backend ingests YTrace instrumentation data(a timeseries of events) and runs statistical analyses anddiagnosis on the event stream. The events and analysesare written to a persistent store that drives an interactivevisualization system.

Data transport. The first step after instrumentation is totransport the data to the indexing and analysis systems.YTrace uses a publish-subscribe messaging system totransport instrumentation events.

Since the YTrace libraries and the transport systemimplement asynchronous write semantics, instrumenta-tion events can incur delivery delays or be delivered outof order, be lost, or sometimes be duplicated. This isparticularly the case for all tracing events in a session,where it is not always feasible to determine if all datafor a session has arrived for analyses. Moreover, dueto event asynchrony, there may be statistical biases incertain analytics leading to false diagnosis. In our im-plementation, we trigger analysis of an event after a de-lay δ; δ is pre-computed as the minimum duration afterwhich any event is delivered with a high likelihood.

In order to find biases, the YTrace backend measuresevent volume as a function of service and datacenter;and uses it to estimate the expected volume at the currenttime. If there is a bias, it does not trigger analysis forthat statistic. Inferring and avoiding bias is a part of ourongoing work.

Indexing. The indexing system provides a high-throughputwrite, low-latency read interface for structured data. Datain YTrace is a timeseries of graphs from the network(topology) and application layers (session traces). Sinceevents for a session are transferred asynchronously, itis important that the writes are idempotent and sessionupdates do not require any reads. The ETL process ma-terializes a number of indices for sessions for commonqueries. We currently use an Apache HBase cluster asour persistent store. Domain-specific queries such as thenetwork paths connecting two server hosts or Internetpath to a client host are processed by an API tier. Suchqueries are useful for diagnosis such as tomography.

Making it real time. In order to make the diagnosisnear-real time, we would need to: (1) minimize the la-

Page 9: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

CD

F

Network-e2e delay ratioFigure 6: Effect of network delays on about 2m user operations withthe Sherpa data store from two Yahoo datacenters.

tency between end of a session and when the events inthe session are analyzed, and (2) build a high-throughputanalytics backend. A significant factor that contributesto the latency above is data movement from multipledatacenters into a central indexing system in a singledatacenter. Note that on the contrary, a central index-ing system improves the analytics throughput; however,the latency induced by wide area data movement de-grades performance more significantly since wide arealinks have limited bandwidth and are shared resources.

To reduce wide area data movement, we are workingon a federated database that is partitioned across all dat-acenters. Each datacenter includes a local indexing sys-tem, and the data partitioning is based on the datacenter-locality of events in user sessions. Events inside a data-center are transported within the datacenter; hence, theevents for a session that is served by two datacenterswill reside in two indexing systems. In the ideal case,the processing for a query would be done at the relevantindexing systems, and the aggregated output(s) returnedto the federation layer – the aggregations are relativelylow-volume. This is, however, not true of some queriessuch as joins, which may require inter-datacenter datamovement.

We are also adding support for approximate queriesto speed up query processing and reduce wide area datamovement. The database has to be aware of data biases,both from transport and partitioning, in order to mini-mize statistical bias in query output. Our work buildson prior work in wide area and approximate query pro-cessing (e.g., the recent work on WANalytics [35] andBlinkDB [4]).

6. CASE STUDIESIn this section, we show some experiences and results

using YTrace in production.

6.1 Distributed Storage

We consider a hosted large-scale, low-latency, dis-tributed key-value storage system, Sherpa [10], that isused as a common storage backend in serving stacks atYahoo in Figure6. Sherpa aims for an SLO of 2ms forkey reads. We look at a multi-layer analysis of latenciesin Sherpa.

Operations with Sherpa traverse router nodes and areserved by Storage Unit (SU) nodes – all connected bythe datacenter network. We use YTrace data to look atthe impact of round-trip network latencies8 on latency oftwo million Sherpa operations in two datacenters. Thefigure shows that the network contributes to a significantfraction of operation latency. The tail of the distribution(top-10%) includes operations that saw variable delaysin the network (e.g., due to congestion or non-shortestpath routing).

Using a simple model of network delay for a key readpayload, we can show that the minimum delay for a readRPC to traverse the router and SU nodes and back is0.5 to 0.9ms (depending on the number of round-tripsTCP takes). Under per-hop queueing or non-shortestpath routing conditions (a router-SU path normally tra-verses 1-2 ToR and/or one AGG device), the delay canbe 0.7-1.3ms. Hence, in order to optimize for operationlatency and maintain SLOs, the storage system could bedesigned to minimize the number of network hops tra-versed by RPCs.

6.2 Datacenter NetworkWe look at datacenter-wide problems from a single

Yahoo datacenter using YTrace’s network diagnosis out-put. The datacenter consists of a large fat-tree networktopology with thousands of network devices. The topol-ogy is made of multiple “tiers”: traversing bottom-up,Top-of-Rack (ToR) devices that connect servers (run-ning services), multiple aggregation (AGG) tiers and acore tier that connects the datacenter to the Internet. RPCsbetween services within the datacenter typically traversethe ToR and AGG tiers; hence, any problems in the twotiers will impact a significant fraction of RPCs in thedatacenter.

Figure5 shows two examples of problem graphs fromToR and AGG tiers in the datacenter network; see §4.3for details. Figure7 shows the distribution of differ-ent problem classes across the three network tiers. Over93% of the problems occur in the ToR switches (whichdominate in number and are relatively low-cost devices).A large fraction of ToR and AGG problems occur in thelower layers (PHY and L2), and sometimes in higherlayers such as the routing plane. On the other hand, mid-dleboxes (that can be topologically placed anywhere in

8Latency computations in this study use a single clock. Net-work latency is: (router-SU RPC exchange at router) - (pro-cessing delay at SU).

Page 10: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

��

����

����

����

����

����

����

���

���

����

��������

���

�����

������ ��

������ ����������

�������������

�������

���� ��

���� ����������

�����������

���

���� ��

���� ����������

���

���

������

�����

�� ��

�� ����������

���������

�����

������������ ������������

�� ������������������

��������

��!

Figure 7: Number of problem graphs for each network tier.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 100 1000

CD

F

User-side first-byte latency (ms)

Internet dominatedCDN+Backend dominated

Figure 8: User-side RPC first-byte latency broken down by the domi-nant bottleneck (Internet or CDN).

the network) show problems mostly in the higher lay-ers (L3 and L4). The duration of the problem graphscan last between a few seconds to hundreds of seconds– which makes the likelihood of RPCs being affectedhigh. We refer the reader to our prior work [22] for moreresults and details of network diagnosis methods.

6.3 Video StackVideo serving stacks can be modeled as three tier ar-

chitectures, spanning the video player (user-side), theCDN and a backend store. A video playback is a se-quence of RPCs by the player to the CDN; the CDNmakes an RPC to the backend store if it does not have theresponse cached. We trace RPCs from the video playerthrough the CDN, while synchronously instrumentingthe TCP stack in the CDN kernel periodically over thecourse of the RPC. We use the TCP measurements asthe source of truth for the Internet path performance,since the delays induced by the kernel space are rela-tively very low. We collect data for all user sessionsover a course of two weeks for this case study.

We first look at the impact of backend and Internetperformance on the user experience. We quantify theuser experience as the first-byte delay for each RPC.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2

CD

F

RTTvar-RTT ratio

userbackend

Figure 9: Network RTT variation of user and datacenter paths in350m user sessions with the Yahoo CDN.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000

CD

F

User-side download stack latency (lower bound, ms)Figure 10: User-side stack latency estimates (lower bound).

Figure8 shows the distribution of user-side latencies af-ter dividing the set of RPCs into two parts: RPCs thatare bottlenecked by the Internet path and RPCs that arebottlenecked by the CDN or backend. About 95% ofthe RPCs are bottlenecked by the Internet path as wewould expect. In the remaining 5% RPCs that are bot-tlenecked by the CDN/backend, the cache miss rate is40% as compared to 2% overall. Further, the user ex-perience degradation due to RPCs bottlenecked by theCDN/backend is tens of milliseconds higher than RPCsbottlenecked by the Internet. This shows that in order totroubleshoot or fix tail latencies, we should focus on theCDN/backend.

When there is a cache miss, the CDN makes an RPCto the backend. We look at RTT of TCP connectionsat the Yahoo CDN – for RPCs from the user host andto the backend in Figure9. The RTT is the delay be-tween a TCP data segment and the corresponding TCPacknowledgement at the CDN node’s TCP stack. For350m user sessions, we analyze thevariation in RTT ofeach TCP connection, defined as the ratio of RTTvar andsmoothed RTT in the kernel. We see that the RTT vari-ation is significantly higher for user connections than

Page 11: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

datacenter connections; however, the tail RTT variationis dominated by the datacenter connections. This alsomakes the case for troubleshooting tail latency problemsby looking at the CDN/backend.

Finally, we show that it is possible to estimate con-tent download latencies in the user host by tracing in-strumentation alone (i.e., timing at user and CDN hosts,and TCP kernel variables). The first-byte delay at theuser host for video segment RPCs to the CDN (δfb) in-cludes an RTT (δrtt) on the Internet, CDN and back-end (if any) latencies (δcdn andδbe), and the downloadstack latency at the user host (δds). Considering the TCPRetransmission Timeout (δrto) as a conservative esti-mate forδrtt, we can estimate a lower bound forδds:δds ≥ δfb − δcdn − δbe − δrto. Figure10shows the dis-tribution of the lower bound of download stack latenciesacross video sessions (truncated to positive real num-bers). We see that the user host contributes tens to hun-dreds of milliseconds of latency when delivering datato the video player application running on the browser.Although user-side download stack delays are not underthe control of a content provider, providers can avoid theeffect of such problems by ordering RPCs to mask theproblem. We refer the reader to our prior work [15] formore results on the video stack.

7. RELATED WORKDiagnosis systems are typically designed for diagnos-

ing a subset of the end-to-end path or a specific layer ofthe stack. YTrace is an attempt to build an end-to-endmulti-layer diagnosis system at web-scale, since perfor-mance troubleshooting activities typically rely on suchinsights. In doing so, it builds on some prior systems.We capture representative work in this section.Distributed tracing systems. Distributed tracing is acommon instrumentation primitive in content providers.Capturing, recording or mining causality between eventsin a distributed trace is necessary to make sense of ses-sion performance. Systems in prior work differ in theamount of instrumentation and trace analysis complex-ity – in fact, there is a tradeoff between instrumenta-tion overhead and analysis complexity to do the sameamount of diagnosis.

Systems implement causality synchronous with exe-cution [13, 32] or mine using historic traces [9]. For ex-ample, Dapper (and its derivative, Zipkin) capture causal-ity betweenspans, which are combinations of requestsand associated responses. While span-level causality isuseful, it is not expressive enough to model RPC execu-tions such as parallelism and redundancy. History-basedcausality mining helps minimize instrumentation over-head in production; it relies, however, on resources foroffline mining of causal relationships. It works well inhomogeneous environments, where there is a common

RPC library and RPC execution patterns are predictable,but may not be feasible in heterogeneous and dynamicruntime environments due to runtime transitions to non-causality within a session. Magpie [8] lies on the in-strumentation side of the spectrum – it captures detailedinstrumentation, such as OS events and packet traces, toinfer causality without needing offline analysis. Whilethis yields detailed diagnoses, it may not be feasible inproduction. Project 5 [5], Mystery Machine [9] and Pin-point [] lie on the analysis side of the spectrum – they re-quire offline resources to mine causality. X-Trace is anexperimental system that requires session tracing sup-port from network devices; having such support helpsdo multi-layer causal discovery synchronously with thesession (a limitation of YTrace), but network supportmay not be feasible in practice.Network diagnosis.There has been significant researchon network diagnosis methods. Sun et al. capture TCPvariables at the CDN to localize performance bottlenecks[34]; they require OS kernel changes in the critical path,since they require TCP instrumentation outside of thetcp_info structure. WhyHigh [20] and LatLong [41]further discover client clusters with performance prob-lems and diagnose user-to-CDN path problems at an ag-gregate level (instead of per-session). Yu et al. [38]and Ghasemi et al. [14] diagnose datacenter networkperformance using detailed instrumentation (e.g., socketlogs and packet traces). At large serving rates, suchlogging may be infeasible. Network tomography tech-niques [12] localize bad performance to network inter-faces; they assume that the path between two hosts isknown – uncommon in datacenter networks. Monitor-Rank [19] uses similar tomography-based localization.Log mining. Service and network log mining are com-mon diagnosis methods. Distalyzer compares anoma-lous logs with known baseline logs [25] for diagnosis.Spectroscope compares two trace logs to understand dif-ferences between them [31]. Xu et al. mine log features[37]. Syslogs have been used to study network-specificfailures in datacenters [16, 27, 28], but not for root causeanalysis. Prior work has not explored causal discoveryfor log analysis – this becomes particularly necessarywhen looking for a small number of cascading problemsin large log volumes.Code and content localization.Binary profiling [6, 7,30] and code mining methods [40] have been used todiagnose performance problems in single hosts down tocode. These systems do not track code-level problemswith user experience. More recently, Pivot Tracing [23]allows users to insert breakpoints in running code andlog them while tracing (synchronously) – mainly tai-lored towards debugging within a distributed system (asopposed to a content provider). Content diagnosis meth-ods used browser modifications [36] and middleboxes

Page 12: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

[18]. We are exploring the feasibility of these methodsin YTrace.

8. DISCUSSION AND CONCLUSIONIn this paper, we presented the design of YTrace, a

system for end-to-end multi-layer performance diagno-sis in large content providers. We formulated a problemstatement that covers diagnosis use cases, and presentedinstrumentation and methods for diagnosis. Our discus-sion opens several research questions that we cover next.

If an RPC is observed to have high latency in the dat-acenter, YTrace currently lists correlated network prob-lems from devices that the RPC potentially traversed(based on syslogs). The longer the problem, the morelikely that RPCs that traversed the devices will be im-pacted. Such correlations are useful towards troubleshoot-ing (esp. when looking at aggregated data). Going fromcorrelation to causation – in other words, whether a net-work problemcausedperformance problems with RPCs– is an open question. It requires apriori knowledge ofdevices the RPC traversed, e.g., using Netflow (sincedatacenter networks use multi-path routing), and infer-ring causal relationships between problems in those de-vices and RPC performance. One approach is to lookfor symptoms in RPCs instrumentation that are causedby each network pathology.

YTrace diagnoses distributed root causes such as cas-cades in the datacenter network, but does not diagnosedistributed root causes across services. Such problemscreate runtime dependencies between services that im-pact performance (despite RPC parallelism and redun-dancy). A common service-level cascading problem isbacklog that builds up across services (typically call-ing services). Distinguishing these from backlogs thatarise due to problems within the host requires appro-priate instrumentation and diagnosis methods. We arelooking into adopting causality-based joint mining ofservice logs and network syslogs to diagnose distributedroot causes.

We assumed that Traffic Engineering (TE) at the CDNis a given: YTrace’s diagnosis is conditioned on the TEfor a user session. Diagnosing performance-sub-optimalTE for a session (i.e., whether a user was directed toa CDN node that caused bad user experience) requiresknowledge of Internet path performance from the userto all CDN nodes at that time; accurately doing it is anopen research direction. A related system, LatLong, di-agnosesaveragelatency [41].

Content providers may not have a complete view ofthe user end-host stack performance (hardware, OS en-vironment, browser, etc.) as the content is parsed, exe-cuted and rendered. Analysis similar to WProf [36] thatdoes not require browser modifications would help di-agnose bottlenecks that reside in the user end-host, and

could be exposed to the content provider similar to Nav-igation Timing [2].

In order to reduce overhead due to instrumentation,YTrace supports sampling a fraction of user sessions.Sampling leads to challenges in analyzing tail latency –it requires inversion of the sampled distribution of exe-cution graphs to estimate high quantiles. For example,prior work on latency looked at estimating confidenceintervals for latency quantiles [33].

In order to localize performance problems to networksand inter-domain links on the Internet, we are lookinginto adopting tomography methods that work on TCPmeasurement data. Tomography methods assume thatthe Internet path for a user IP address is known. Inpractice, provider networks may use multipath routing.Without additional active probing or data (e.g., Netflow)at the time of the RPC flow [26], it is challenging to findthe sequence of Internet hops that a given RPC took.

Finally, Internet and transit providers can deploy traf-fic management mechanisms that do not follow conven-tional wisdom and can impact performance. For ex-ample, traffic shaping leads to changes in link capacity,which can impact long-running flows. YTrace currentlydiagnoses such mechanisms as a part of the user-CDNInternet path; diagnosing such root causes, however, isan open problem. Recent work on tomography showsthat the methods can be used (under sufficient samplesize) to find content discrimination [39], under assump-tions of static routing.

Web-scale performance diagnosis requires re-thinkingfrom ground up: the instrumentation design, algorithmsand systems design to enable near real time diagnoses.There is an inherent tradeoff between complexity of andhow detailed diagnosis could be, versus the amount ofper-session instrumentation volume we can collect inproduction at scale. Traditional methods such as tomog-raphy and blackbox RPC causality learning are hard toapply in large-scale heterogeneous cloud environments.YTrace is an attempt to accomplish performance diag-nosis at scale.

AcknowledgementsAhmed Mansy led the TCP instrumentation effort. We would like

to thank many Yahoos who we had insightful discussions with,and

who contributed to YTrace instrumentation, deployment andbackend:

Jayanth Vijayaraghavan, Vahid Fatourehchi, Joshua Blatt,Christophe

Doritis, Vidya Srinivasan, Srikanth Sampath Kumar, ArvindMurthy,

Nishant Mishra, Jacob Cherackal, Powell Molleti, Antonia Kwok, Bhu-

pendra Singh, Seema Datar, Evan Torrie, Rupesh Chhatrapati, Mau-

rice Barnum, Amit Jain, Ramachandran Subramaniam, Amar Kamat,

Tague Griffith, Archie Russell, Tim Miller, Scott Beardsley, Amotz

Maimon, Ian Flint, Rick Hawes and Benoit Schillings. We thank Chen

Liang (Duke) for syslog analysis and Jennifer Rexford (Princeton) for

helpful discussions on the video stack.

Page 13: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

9. REFERENCES[1] Retail Web Site Performance: Consumer Reaction to a

Poor Online Shopping Experience.Jupiter Research andAkamai.http://www.akamai.com/4seconds(2006).

[2] Navigation Timing, W3C Recommendation.http://www.w3.org/TR/navigation-timing/ (2012).

[3] Resource Timing, W3C Recommendation.http://www.w3.org/TR/resource-timing/ (2014).

[4] AGARWAL , S., MILNER, H., KLEINER, A.,TALWALKAR , A., JORDAN, M. I., MADDEN, S.,MOZAFARI, B., AND STOICA, I. Knowing when you’rewrong: Building fast and reliable approximate queryprocessing systems. InSIGMOD(2014), ACM.

[5] AGUILERA, M. K., MOGUL, J. C., WIENER, J. L.,REYNOLDS, P.,AND MUTHITACHAROEN, A.Performance debugging for distributed systems of blackboxes. InACM SIGOPS Operating Systems Review(2003), vol. 37, ACM, pp. 74–89.

[6] A NDERSON, J. M., BERC, L. M., DEAN, J.,GHEMAWAT, S., HENZINGER, M. R., LEUNG,S.-T. A., SITES, R. L., VANDEVOORDE, M. T.,WALDSPURGER, C. A., AND WEIHL , W. E.Continuous profiling: where have all the cycles gone?ACM Transactions on Computer Systems 15, 4 (1997),357–390.

[7] ATTARIYAN , M., CHOW, M., AND FLINN , J. X-ray:Automating Root-Cause Diagnosis of PerformanceAnomalies in Production Software. InUSENIX OSDI(2012).

[8] BARHAM , P., DONNELLY, A., ISAACS, R., ANDMORTIER, R. Using magpie for request extraction andworkload modelling. InOSDI (2004), vol. 4, pp. 18–18.

[9] CHOW, M., MEISNER, D., FLINN , J., PEEK, D., ANDWENISCH, T. F. The Mystery Machine: End-to-endperformance analysis of large-scale Internet services. InUSENIX OSDI(2014).

[10] COOPER, B. F., RAMAKRISHNAN , R., SRIVASTAVA ,U., SILBERSTEIN, A., BOHANNON, P., JACOBSEN,H.-A., PUZ, N., WEAVER, D., AND YERNENI, R.PNUTS: Yahoo!’s hosted data serving platform.Proceedings of the VLDB Endowment 1, 2 (2008),1277–1288.

[11] DEAN, J.,AND BARROSO, L. A. The tail at scale.Communications of the ACM 56, 2 (2013), 74–80.

[12] DUFFIELD, N. Simple network performancetomography. InACM SIGCOMM IMC(2003).

[13] FONSECA, R., PORTER, G., KATZ , R. H., SHENKER,S.,AND STOICA, I. X-trace: A pervasive networktracing framework. InUSENIX NSDI(2007).

[14] GHASEMI, M., BENSON, T., AND REXFORD, J.Real-time Diagnosis of TCP Performance in Clouds. InACM CoNEXT Student Workshop 2013.

[15] GHASEMI, M., KANUPARTHY, P., MANSY, A.,BENSON, T., AND REXFORD, J. PerformanceCharacterization of a Commercial Video StreamingService.arXiv e-prints 1605.04966(May 2016).

[16] GILL , P., JAIN , N., AND NAGAPPAN, N.Understanding network failures in data centers:Measurement, analysis, and implications. InACMSIGCOMM 2011.

[17] KANUPARTHY, P.,AND DOVROLIS, C. Pythia:Diagnosing performance problems in wide areaproviders. InUSENIX ATC(2014).

[18] K ICIMAN , E., AND L IVSHITS, B. Ajaxscope: aplatform for remotely monitoring the client-sidebehavior of web 2.0 applications. InACM SOSP(2007).

[19] K IM , M., SUMBALY , R., AND SHAH , S. Root causedetection in a service-oriented architecture. InACM

SIGMETRICS(2013).[20] KRISHNAN, R., MADHYASTHA , H. V., SRINIVASAN ,

S., JAIN , S., KRISHNAMURTHY, A., ANDERSON, T.,AND GAO, J. Moving beyond end-to-end pathinformation to optimize cdn performance. InACMSIGCOMM IMC(2009).

[21] LAMPORT, L. Time, clocks, and the ordering of eventsin a distributed system.Communications of the ACM 21,7 (1978), 558–565.

[22] L IANG , C., BENSON, T., KANUPARTHY, P.,AND HE,Y. Finding Needles in the Haystack: HarnessingSyslogs for Data Center Management.arXiv e-prints1605.06150(May 2016).

[23] MACE, J., ROELKE, R., AND FONSECA, R. Pivottracing: dynamic causal monitoring for distributedsystems. InACM SOSP(2015).

[24] M ITZENMACHER, M. How useful is old information?Parallel and Distributed Systems, IEEE Transactions on11, 1 (2000), 6–20.

[25] NAGARAJ, K., K ILLIAN , C., AND NEVILLE , J.Structured comparative analysis of systems logs todiagnose performance problems. InUSENIX NSDI(2012).

[26] PAN , S., ZHANG, Z., YU, F., AND HU, G. End-to-endmeasurements for network tomography under multipathrouting.IEEE Communications Letters(2014).

[27] POTHARAJU, R., AND JAIN , N. When the networkcrumbles: An empirical study of cloud network failuresand their impact on services. InACM SoCC 2013.

[28] POTHARAJU, R., AND JAIN , N. Demystifying the darkside of the middle: A field study of middlebox failuresin datacenters. InACM SIGCOMM IMC(2013).

[29] QIU , T., GE, Z., PEI, D., WANG, J.,AND XU, J. Whathappened in my network: Mining network events fromrouter syslogs. InACM SIGCOMM IMC(2010).

[30] REN, G., TUNE, E., MOSELEY, T., SHI , Y., RUS, S.,AND HUNDT, R. Google-wide profiling: A continuousprofiling infrastructure for data centers.IEEE micro 30,4 (2010), 65–79.

[31] SAMBASIVAN , R. R., ZHENG, A. X., DE ROSA, M.,KREVAT, E., WHITMAN , S., STROUCKEN, M.,WANG, W., XU, L., AND GANGER, G. R. Diagnosingperformance changes by comparing request flows. InUSENIX NSDI(2011).

[32] SIGELMAN , B. H., BARROSO, L. A., BURROWS, M.,STEPHENSON, P., PLAKAL , M., BEAVER, D.,JASPAN, S.,AND SHANBHAG , C. Dapper: a large-scaledistributed systems tracing infrastructure.Google TechReport(2010).

[33] SOMMERS, J., BARFORD, P., DUFFIELD, N., ANDRON, A. Multiobjective monitoring for sla compliance.IEEE/ACM Transactions on Networking (TON) 18, 2(2010), 652–665.

[34] SUN, P., YU, M., FREEDMAN, M. J., AND REXFORD,J. Identifying performance bottlenecks in CDNsthrough TCP-level monitoring. InACM SIGCOMMW-MUST(2011).

[35] VULIMIRI , A., CURINO, C., GODFREY, B.,KARANASOS, K., AND VARGHESE, G. WANalytics:Analytics for a Geo-Distributed Data-Intensive World.In CIDR (2015).

[36] WANG, X. S., BALASUBRAMANIAN , A.,KRISHNAMURTHY, A., AND WETHERALL, D.Demystifying Page Load Performance with WProf. InUSENIX NSDI(2013).

[37] XU, W., HUANG, L., FOX, A., PATTERSON, D., ANDJORDAN, M. I. Detecting large-scale system problemsby mining console logs. InACM SOSP(2009).

[38] YU, M., GREENBERG, A. G., MALTZ , D. A.,REXFORD, J., YUAN , L., KANDULA , S.,AND K IM ,

Page 14: , Sudhir Pathak , Sambit Samal , P. P. S. Narayan arXiv ... · Partha Kanuparthy‡, Yuchen Dai †, Sudhir Pathak , Sambit Samal†, Theophilus Benson§, Mojgan Ghasemi⊥, P. P.

C. Profiling network performance for multi-tier datacenter applications. InUSENIX NSDI(2011).

[39] ZHANG, Z., MARA , O., AND ARGYRAKI , K. Networkneutrality inference.ACM SIGCOMM ComputerCommunication Review 44, 4 (Aug. 2014), 63–74.

[40] ZHAO, X., ZHANG, Y., L ION, D., ULLAH , M. F.,LUO, Y., YUAN , D., AND STUMM , M. lprof: A

non-intrusive request flow profiler for distributedsystems. InUSENIX OSDI(2014).

[41] ZHU, Y., HELSLEY, B., REXFORD, J., SIGANPORIA,A., AND SRINIVASAN , S. Latlong: Diagnosingwide-area latency changes for cdns.Network andService Management, IEEE Transactions on 9, 3 (2012),333–345.