Top Banner
Foundations of data-intensive science: Technology and practice for high throughput, widely distributed, data management and analysis systems American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014 W. Johnston, E. Dart, M. Ernst*, and B. Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley, California, U.S.A and *Brookhaven National Laboratory Upton, New York, USA
78

American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014

Feb 25, 2016

Download

Documents

washi

Foundations of data-intensive science: Technology and practice for high throughput, widely distributed, data management and analysis systems. American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

Foundations of data-intensive science Technology and practice for

high throughput widely distributed data management and analysis

systemsAmerican Astronomical Society Topical Conference SeriesExascale Radio AstronomyMonterey CAMarch 30 ndash April 4 2014

W Johnston E Dart M Ernst and B Tierney

ESnet and Lawrence Berkeley National Laboratory

Berkeley California USA

and

Brookhaven National Laboratory

Upton New York USA

2

Data-Intensive Science in DOErsquos Office of ScienceThe US Department of Energyrsquos Office of Science (ldquoSCrdquo)

supports about half of all civilian RampD in the US with about $5Byear in funding (with the National Science Foundation (NSF) funding the other half)ndash Funds some 25000 PhDs and PostDocs in the university

environmentndash Operates ten National Laboratories and dozens of major

scientific user facilities such as synchrotron light sources neutron sources particle accelerators electron and atomic force microscopes supercomputer centers etc that are all available to the US and Global science research community and many of which generate massive amounts of data and involve large distributed collaborations

ndash Supports global large-scale science collaborations such as the LHC at CERN and the ITER fusion experiment in France

ndash wwwsciencedoegov

3

DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an

SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and

computational steeringndash Collaboration with the US and International

Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring

that the network can be used effectively to provide all mission required access to data and computing

bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab

Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities

4

HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management

and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex

instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the

number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international

ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs

bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area

ndash the data output of these increasingly sophisticated sensors has increased exponentially

bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge

5

HEP as a Prototype for Data-Intensive Science

Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation

HEP data volumes for leading experimentswith Belle-II estimates

LHC down for upgrade

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 2: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

2

Data-Intensive Science in DOErsquos Office of ScienceThe US Department of Energyrsquos Office of Science (ldquoSCrdquo)

supports about half of all civilian RampD in the US with about $5Byear in funding (with the National Science Foundation (NSF) funding the other half)ndash Funds some 25000 PhDs and PostDocs in the university

environmentndash Operates ten National Laboratories and dozens of major

scientific user facilities such as synchrotron light sources neutron sources particle accelerators electron and atomic force microscopes supercomputer centers etc that are all available to the US and Global science research community and many of which generate massive amounts of data and involve large distributed collaborations

ndash Supports global large-scale science collaborations such as the LHC at CERN and the ITER fusion experiment in France

ndash wwwsciencedoegov

3

DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an

SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and

computational steeringndash Collaboration with the US and International

Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring

that the network can be used effectively to provide all mission required access to data and computing

bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab

Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities

4

HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management

and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex

instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the

number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international

ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs

bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area

ndash the data output of these increasingly sophisticated sensors has increased exponentially

bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge

5

HEP as a Prototype for Data-Intensive Science

Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation

HEP data volumes for leading experimentswith Belle-II estimates

LHC down for upgrade

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 3: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

3

DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an

SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and

computational steeringndash Collaboration with the US and International

Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring

that the network can be used effectively to provide all mission required access to data and computing

bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab

Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities

4

HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management

and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex

instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the

number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international

ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs

bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area

ndash the data output of these increasingly sophisticated sensors has increased exponentially

bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge

5

HEP as a Prototype for Data-Intensive Science

Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation

HEP data volumes for leading experimentswith Belle-II estimates

LHC down for upgrade

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 4: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

4

HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management

and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex

instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the

number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international

ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs

bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area

ndash the data output of these increasingly sophisticated sensors has increased exponentially

bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge

5

HEP as a Prototype for Data-Intensive Science

Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation

HEP data volumes for leading experimentswith Belle-II estimates

LHC down for upgrade

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 5: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

5

HEP as a Prototype for Data-Intensive Science

Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation

HEP data volumes for leading experimentswith Belle-II estimates

LHC down for upgrade

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 6: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

6

HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the

size of the data sets used by science

ldquoHEP data collectedrdquo 2012 estimate (green line) in previous

slide

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 7: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

7

HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the

methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that

the data was analyzed locallyndash As the collaborations grew to several institutions and the data

analysis shared among them the data was distributed by shipping tapes around

ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around

The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)

Similar changes are occurring in most science disciplines

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 8: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

8

HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS

and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec

with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)

bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined

50 Gbs that is distributed to physics groups around the world 7x24x~9moyr

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 9: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

The LHC data management model involves a world-wide collection of centers that store manage and analyze the data

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs ndash 8 Pbs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC(one of two detectors)

LHC Tier 0

Taiwan Canada USA-Atlas USA-CMS

Nordic

UKNetherlands Germany Italy

Spain

FranceCERN

Tier 1 centers hold working data

Tape115 PBy

Disk60 PBy

Cores68000

Tier 2 centers are data caches and analysis sites

0

(WLCG

120 PBy

2012)

175000

3 X data outflow vs inflow

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 10: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

10

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 11: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

11

HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data

movement involve hardware and software developments at all levels1 The underlying network

1a Optical signal transport1b Network routers and switches

2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services

bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 12: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

12

HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science

disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science

disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments

Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 13: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

SKA data flow model is similar to the LHCreceptorssensors

correlator data processor

supercomputer

European distribution point

~200km avg

~1000 km

~25000 km(Perth to London via USA)

or~13000 km

(South Africa to London)

Regionaldata center

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

Universitiesastronomy

groups

93 ndash 168 Pbs

400 Tbs

100 Gbs

from SKA RFI

Hypothetical(based on the

LHC experience)

These numbers are based on modeling prior to splitting the

SKA between S Africa and Australia)

Regionaldata center

Regionaldata center

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 14: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

14

Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in

technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are

included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC

Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)

ndash Also included arebull the LHC ATLAS data management and analysis approach that generates

and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the

LHC traffic

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 15: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

1) Underlying network issuesAt the core of our ability to transport the volume of data

that we must deal with today and to accommodate future growth are advances in optical transport technology and

router technology

0

5

10

15

Peta

byte

sm

onth

13 years

We face a continuous growth of data to transport

ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 16: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

16

We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the

next 10 yearsNew generations of instruments ndash for example the Square

Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC

In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x

100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint

ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network

bull What has made this possible

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 17: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

17

1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave

division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying

(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization

ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half

bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)

Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum

bull Actual transmission rate is about 10 higher to include FEC data

ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 18: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

18

Optical Network Technology ESnet5rsquos optical network uses Cienarsquos 6500 Packet-Optical Platform with

WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each

bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)

bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area

NEWG

SUNN

KANSDENV

SALT

BOIS

SEAT

SACR

WSAC

LOSA

LASV

ELPA

ALBU

ATLA

WASH

NEWY

BOST

SNLL

PHOE

PAIX

NERSC

LBNLJGI

SLAC

NASHCHAT

CLEV

EQCH

STA

R

ANLCHIC

BNL

ORNL

CINC

SC11

STLO

Internet2

LOUI

FNA

L

Long IslandMAN and

ANI Testbed

O

JACKGeography is

only representational

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 19: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

19

1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent

7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces

bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)

MAN LAN (New York) and Sunnyvale (San Francisco)

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 20: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

20

Metro area circuits

SNLL

PNNL

MIT

PSFC

AMES

LLNL

GA

JGI

LBNL

SLACNER

SC

ORNL

ANLFNAL

SALT

INL

PU Physics

SUNN

SEAT

STAR

CHIC

WASH

ATLA

HO

US

BOST

KANS

DENV

ALBQ

LASV

BOIS

SAC

R

ELP

A

SDSC

10

Geographical representation is

approximate

PPPL

CH

AT

10

SUNN STAR AOFA100G testbed

SF Bay Area Chicago New York AmsterdamAMST

US RampE peerings

NREL

Commercial peerings

ESnet routers

Site routers

100G

10-40G

1G Site provided circuits

LIGO

Optical only

SREL

100thinsp

Intrsquol RampE peerings

100thinsp

JLAB

10

10100thinsp

10

100thinsp100thinsp

1

10100thinsp

100thinsp1

100thinsp100thinsp

100thinsp

100thinsp

BNL

NEWY

AOFA

NASH

1

LANL

SNLA

10

10

1

10

10

100thinsp

100thinsp

100thinsp10

1010

100thinsp

100thinsp

10

10

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp

100thinsp100thinsp

100thinsp

10

100thinsp

The Energy Sciences Network ESnet5 (Fall 2013)

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 21: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

2) Data transport The limitations of TCP must be addressed for large long-distance flows

Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-

intensive scienceUsing TCP to support the sustained long distance high data-

rate flows of data-intensive science requires an error-free network

Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases

in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending

on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 22: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

22

Transportbull The reason for TCPrsquos sensitivity to packet loss is that the

slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as

evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)

ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 23: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

23

Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is

minimalOn a 10Gbs WAN path the impact of low packet loss rates is

enormous (~80X throughput reduction on transatlantic path)

Implications Error-free paths are essential for high-volume long-distance data transfers

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss

Reno (measured)

Reno (theory)

H-TCP(measured)

No packet loss

(see httpfasterdataesnetperformance-testingperfso

nartroubleshootingpacket-loss)

Network round trip time ms (corresponds roughly to San Francisco to London)

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Thro

ughp

ut M

bs

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 24: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

24

Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP

protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full

speed after an error forces a reset to low bandwidth

TCP Results

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35time slot (5 second intervals)

Mbi

tss

econ

d

Linux 26 BIC TCPLinux 24Linux 26 BIC off

RTT = 67 ms

ldquoBinary Increase Congestionrdquo control algorithm impact

Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the

default is CUBIC a refined version of BIC designed for high bandwidth

long paths)

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 25: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

25

Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of

packet loss on a long path high-speed network

bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf

Reno (measured)

Reno (theory)

H-TCP (CUBIC refinement)(measured)

Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)

Roundtrip time ms (corresponds roughly to San Francisco to London)

1000

900800700600500400300200100

0

Thro

ughp

ut M

bs

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 26: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

26

3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously

end-to-end to detect soft errors and facilitate their isolation and correction

perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)

bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving

perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])

ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 27: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

27

perfSONARThe test and monitor functions can detect soft errors that limit

throughput and can be hard to find (hard errors faults are easily found and corrected)

Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card

Gb

s

normal performance

degrading performance

repair

bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very

challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this

bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device

one month

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 28: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

28

perfSONARThe value of perfSONAR increases dramatically as it is

deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-

to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the

smallest user sites ndash Internet2 is close to the same

bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 29: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

29

4) System software evolution and optimizationOnce the network is error-free there is still the issue of

efficiently moving data from the application running on a user system onto the network

bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)

bull Data transfer tools and parallelism

bull Other data transfer issues (firewalls etc)

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 30: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

30

41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of

TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket

buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for

todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB

bull 150X bigger than the default buffer size

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 31: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

31

System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-

global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the

destination so potentially a lot of special cases

Auto-tuning TCP connection buffer size within pre-configured limits helps

Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 32: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

32

System software tuning Host tuning ndash TCP

Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size

hand tuned to 64 MBy window

Roundtrip time ms (corresponds roughlyto San Francisco to London)

path length

10000900080007000600050004000300020001000

0

Thro

ughp

ut M

bs

auto tuned to 32 MBy window

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 33: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

33

42) System software tuning Data transfer toolsParallelism is key in data transfer tools

ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection

bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)

ndash Several tools offer parallel transfers (see below)

Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN

transfersndash Many tools and protocols assume latencies typical of a LAN

environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long

path networks

bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more

than about 500 Mbs

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 34: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

34

System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL

RTT = 53 ms network capacity = 10GbpsTool Throughput

bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology

bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase

bull this helps rsync too

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 35: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

35

System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-

performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open

ports) ssh etc The newer Globus Online incorporates all of these and small file

support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community

outside of HEP

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 36: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

36

System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach

ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node

ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and

httpmonalisacernchFDT

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 37: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

37

44) System software tuning Other issuesFirewalls are anathema to high-peed data flows

ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for

TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo

Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf

bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning

bull Defaults are usually fine for 1GE but 10GE often requires additional tuning

ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo

([HPBulk])

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 38: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

5) Site infrastructure to support data-intensive scienceThe Science DMZ

With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the

bottleneckThe site network (LAN) typically provides connectivity for local

resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network

and the local area site network is critical for large-scale data movement

Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks

for business and small data-flow purposes usually donrsquot work for large-scale data flows

bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data

flows

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 39: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

39

The Science DMZTo provide high data-rate access to local resources the site

LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high

speed data path all the way back to the source

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 40: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

40

The Science DMZThe ScienceDMZ concept

The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and

rapid fault isolation typically perfSONAR (see [perfSONAR] and below)

A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)

This is so important it was a requirement for last round of NSF CC-NIE grants

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 41: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

41

The Science DMZ

(See httpfasterdataesnetscience-dmz

and [SDMZ] for a much more complete

discussion of the various approaches)

campus siteLAN

high performanceData Transfer Node

computing cluster

cleanhigh-bandwidthWAN data path

campussiteaccess to

Science DMZresources is via the site firewall

secured campussiteaccess to Internet

border routerWAN

Science DMZrouterswitch

campus site

Science DMZ

Site DMZ WebDNS

Mail

network monitoring and testing

A WAN-capable device

per-servicesecurity policycontrol points

site firewall

dedicated systems built and

tuned for wide-area data transfer

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 42: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

42

6) Data movement and management techniquesAutomated data movement is critical for moving 500

terabytesday between 170 international sites In order to effectively move large amounts of data over the

network automated systems must be used to manage workflow and error recovery

bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers

bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 43: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

43

Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the

analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates

compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day

bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10

petabytes of datayear in order to accomplish its science

bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 44: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

44

DDMAgent

DDMAgent

ATLAS production

jobs

Regional production

jobs

User Group analysis jobs

Data Service

Task Buffer(job queue)

Job Dispatcher

PanDA Server(task management)

Job Broker

Policy(job type priority)

ATLA

S Ti

er 1

Data

Cen

ters

11 s

ites

scat

tere

d ac

ross

Euro

pe N

orth

Am

erica

and

Asia

in

aggr

egat

e ho

ld 1

copy

of a

ll dat

a an

d pr

ovide

the

work

ing

data

set f

or d

istrib

ution

to T

ier 2

cen

ters

for a

nalys

isDistributed

Data Manager

Pilot Job(Panda job

receiver running under the site-

specific job manager)

Grid Scheduler

Site Capability Service

CERNATLAS detector

Tier 0 Data Center(1 copy of all data ndash

archival only)

Job resource managerbull Dispatch a ldquopilotrdquo job manager - a

Panda job receiver - when resources are available at a site

bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA

bull Similar to the Condor Glide-in approach

Site status

ATLAS analysis sites(eg 70 Tier 2 Centers in

Europe North America and SE Asia)

DDMAgent

DDMAgent

1) Schedules jobs initiates data movement

2) DDM locates data and moves it to sites

This is a complex system in its own right called DQ2

3) Prepares the local resources to receive Panda jobs

4) Jobs are dispatched when there are resources available and when the required data is

in place at the site

Thanks to Michael Ernst US ATLAS technical lead for his assistance with this

diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)

The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday

CERN

Try to move the job to where the data is else move data and job to where

resources are available

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 45: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

45

Scale of ATLAS analysis driven data movement

The PanDA jobs executing at centers all over Europe N America and SE Asia generate network

data movement of 730 TByday ~68Gbs

Accumulated data volume on disk

730 TBytesday

PanDA manages 120000ndash140000 simultaneous jobs    (PanDA manages two types of jobs that are shown separately here)

It is this scale of data movementgoing on 24 hrday 9+ monthsyr

that networks must support in order to enable the large-scale science of the LHC

0

50

100

150

Peta

byte

s

four years

0

50000

100000

type

2jo

bs

0

50000

type

1jo

bs

one year

one year

one year

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 46: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

46

Building an LHC-scale production analysis system In order to debug and optimize the distributed system that

accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in

ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC

production

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 47: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

47

Ramp-up of LHC traffic in ESnet

(est of ldquosmallrdquo scale traffic)

LHC

turn

-on

LHC data systemtesting

LHC operationThe transition from testing to operation

was a smooth continuum due toat-scale testing ndash a process that took

more than 5 years

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 48: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

48

6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument

to data centers ndash a dedicated purpose-built infrastructure is needed

bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to

the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the

Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 49: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

49

The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward

exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community

bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by

bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])

ndash that is only LHC data and compute servers are connected to the OPN

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 50: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

50

The LHC OPN ndash Optical Private Network

UK-T1_RAL

NDGF

FR-CCIN2P3

ES-PIC

DE-KIT

NL-T1

US-FNAL-CMS

US-T1-BNL

CA-TRIUMF

TW-ASCG

IT-NFN-CNAF

CH-CERNLHCOPN physical (abbreviated)

LHCOPN architecture

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 51: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

51

The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1

centers data transfer was to use dedicated physical 10G circuits

Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than

5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)

ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 52: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

Managing large-scale science traffic in a shared infrastructure

The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)

bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism

bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 53: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

53

The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure

designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)

The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each

involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc

ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)

In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure

that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 54: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

54

ESnetUSA

Chicago

New YorkBNL-T1

Internet2USA

Harvard

CANARIECanada

UVic

SimFraU

TRIUMF-T1

UAlb UTorMcGilU

Seattle

TWARENTaiwan

NCU NTU

ASGCTaiwan

ASGC-T1

KERONET2Korea

KNU

LHCONE VRF domain

End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1

Regional RampE communication nexus

Data communication links 10 20 and 30 Gbs

See httplhconenet for details

NTU

Chicago

LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity

NORDUnetNordic

NDGF-T1aNDGF-T1a NDGF-T1c

DFNGermany

DESYGSI DE-KIT-T1

GARRItaly

INFN-Nap CNAF-T1

RedIRISSpain

PIC-T1

SARANetherlands

NIKHEF-T1

RENATERFrance

GRIF-IN2P3

Washington

CUDIMexico

UNAM

CC-IN2P3-T1Sub-IN2P3

CEA

CERNGeneva

CERN-T1

SLAC

GLakes

NE

MidWSoW

Geneva

KISTIKorea

TIFRIndia

India

Korea

FNAL-T1

MIT

CaltechUFlorida

UNebPurU

UCSDUWisc

UltraLightUMich

Amsterdam

GEacuteANT Europe

April 2012

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 55: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

55

The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because

ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for

use by the LHC collaboration that cannot be made available for general RampE traffic

bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic

bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance

bull See LHCONEnet

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 56: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

LHCONE is one part of the network infrastructure that supports the LHC

CERN rarrT1 miles kms

France 350 565

Italy 570 920

UK 625 1000

Netherlands 625 1000

Germany 700 1185

Spain 850 1400

Nordic 1300 2100

USA ndash New York 3900 6300

USA - Chicago 4400 7100

Canada ndash BC 5200 8400

Taiwan 6100 9850

CERN Computer Center

The LHC Optical Private Network

(LHCOPN)

LHC Tier 1Data Centers

LHC Tier 2 Analysis Centers

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups Universities

physicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

Universitiesphysicsgroups

The LHC Open Network

Environment(LHCONE)

50 Gbs (25Gbs ATLAS 25Gbs CMS)

detector

Level 1 and 2 triggers

Level 3 trigger

O(1-10) meter

O(10-100) meters

O(1) km

1 PBs

500-10000 km

This is intended to indicate that the physics

groups now get their datawherever it is most readily

available

A Network Centric View of the LHC

Taiwan Canada USA-Atlas USA-CMS

Nordic

UK

Netherlands Germany Italy

Spain

FranceCERN

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 57: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

57

7) New network servicesPoint-to-Point Virtual Circuit Service

Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly

consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of

systemsrdquondash Break up the task of massive data analysis and use data compute and

storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements

A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work

well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 58: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

58

Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual

circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism

bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path

ndash MPLS and OpenFlow are examples of this and both can transport IP packets

ndash Most modern Internet routers have this type of functionality

bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths

when they are set up

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 59: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

59

Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service

(For more information contact the project lead Chin Guok chinesnet)

bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references

bull OSCARS received a 2013 ldquoRampD 100rdquo award

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 60: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

60

End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo

ndash Sites for the most part

bull How are the circuits usedndash End system to end system IP

bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure

networks

ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future

ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP

protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)

ndash Point-to-point connection between routing instance ndash eg BGP at the end points

bull Essentially this is how all current circuits are used from one site router to another site router

ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 61: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

61

End User View of Circuits ndash How They Use Thembull When are the circuits used

ndash Mostly to solve a specific problem that the general infrastructure cannot

bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 62: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

62

Cross-Domain Virtual Circuit Servicebull Large-scale science always involves institutions in multiple

network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains

involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET

(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 63: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

63

Inter-Domain Control Protocolbull There are two realms involved

1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains

2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains

FNAL (AS3152)[US]

ESnet (AS293)[US]

GEANT (AS20965)[Europe]

DFN (AS680)[Germany]

DESY (AS1754)[Germany]

Topology exchange

VC setup request

Local InterDomain

Controller

Local IDC

Local IDC

Local IDC

Local IDC

VC setup request

VC setup request

VC setup request

OSCARS

User source

User destination

VC setup request

data plane connection helper at each domain ingressegress point

data plane connection helper at each domain ingressegress point

1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to

domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process

The end-to-end virtual circuit

AutoBAHN

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 64: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

64

Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved

into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated

Facility - an international virtual organization that promotes the paradigm of lambda networking)

bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system

Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 65: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

65

8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the

network to support data-intensive sciencendash With each generation of network transport technology

bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and

application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo

end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of

the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities

includebull experiments in the 1990s in using parallel disk IO and parallel network IO

together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths

bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 66: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

66

Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects

are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build

a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at

httpfasterdataesnet and contains contributions from several organizations

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 67: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

67

The knowledge base httpfasterdataesnet topics

ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on

bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained

bull fasterdataesnet is a community project with contributions from several organizations

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 68: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

68

The MessageA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can

be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 69: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

69

Infrastructure Critical to Sciencebull The combination of

ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools

now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC

bull Other disciplines that involve data-intensive science will face most of these same issues

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 70: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

70

LHC lessons of possible use to the SKAThe similarities

bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries

bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously

bull The data is generatedsent to a single location and then distributed to science groups

bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 71: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

71

LHC lessons of possible use to the SKAThe lessons

The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one

location (eg the SKA supercomputer center) and this is done at CERN for the LHC

ndash The technical aspects of building and operating a centralized working data repository

bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time

bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites

mitigates against a single large data center

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 72: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

72

LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional

centers) has worked wellndash It decentralizes costs and involves many countries directly in the

telescope infrastructurendash It divides up the network load especially on the expensive trans-

ocean linksndash It divides up the cache IO load across distributed sites

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 73: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

73

LHC lessons of possible use to the SKARegardless of distributed vs centralized working data

repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source

to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a

centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue

bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)

If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC

ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 74: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

74

LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo

observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could

address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration

Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ

Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using

synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 75: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

75

The MessageAgain hellipA significant collection of issues must all be

addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data

management can be done on a routine basis

Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 76: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

76

References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston

Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz

[fasterdata] See httpfasterdataesnetfasterdataperfSONAR

[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more

[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory

[LHCONE] httplhconenet

[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo

[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

[OIF1] OIF-FD-100G-DWDM-010 - 100G Ultra Long Haul DWDM Framework Document (June 2009) httpwwwoiforumcompublicdocumentsOIF-FD-100G-DWDM-010pdf

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 77: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

77

References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations

ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)
Page 78: American Astronomical Society Topical Conference  Series: Exascale  Radio  Astronomy Monterey, CA March 30 – April 4, 2014

78

References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations

httpwwwperfsonarnet

httppspsperfsonarnet

[REQ] httpswwwesnetaboutscience-requirements

[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010

(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )

[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223

[Tracy1] httpwwwnanogorgmeetingsnanog55presentationsTuesdayTracypdf

  • Foundations of data-intensive science Technology and practice
  • Data-Intensive Science in DOErsquos Office of Science
  • DOE Office of Science and ESnet ndash the ESnet Mission
  • HEP as a Prototype for Data-Intensive Science
  • HEP as a Prototype for Data-Intensive Science (2)
  • HEP as a Prototype for Data-Intensive Science (3)
  • HEP as a Prototype for Data-Intensive Science (4)
  • HEP as a Prototype for Data-Intensive Science (5)
  • The LHC data management model involves a world-wide collection
  • Scale of ATLAS analysis driven data movement
  • HEP as a Prototype for Data-Intensive Science (6)
  • HEP as a Prototype for Data-Intensive Science (7)
  • SKA data flow model is similar to the LHC
  • Foundations of data-intensive science
  • 1) Underlying network issues
  • We face a continuous growth of data transport
  • 1a) Optical Network Technology
  • Optical Network Technology
  • 1b) Network routers and switches
  • The Energy Sciences Network ESnet5 (Fall 2013)
  • 2) Data transport The limitations of TCP must be addressed for
  • Transport
  • Transport Impact of packet loss on TCP
  • Transport Modern TCP stack
  • Transport Modern TCP stack (2)
  • 3) Monitoring and testing
  • perfSONAR
  • perfSONAR (2)
  • 4) System software evolution and optimization
  • 41) System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • System software tuning Host tuning ndash TCP
  • 42) System software tuning Data transfer tools
  • System software tuning Data transfer tools
  • System software tuning Data transfer tools (2)
  • System software tuning Data transfer tools (3)
  • 44) System software tuning Other issues
  • 5) Site infrastructure to support data-intensive science The Sc
  • The Science DMZ
  • The Science DMZ (2)
  • The Science DMZ (3)
  • 6) Data movement and management techniques
  • Highly distributed and highly automated workflow systems
  • Slide 44
  • Scale of ATLAS analysis driven data movement (2)
  • Building an LHC-scale production analysis system
  • Ramp-up of LHC traffic in ESnet
  • 6 cont) Evolution of network architectures
  • The LHC OPN ndash Optical Private Network
  • The LHC OPN ndash Optical Private Network (2)
  • The LHC OPN ndash Optical Private Network (3)
  • Managing large-scale science traffic in a shared infrastructure
  • The LHCrsquos Open Network Environment ndash LHCONE
  • Slide 54
  • The LHCrsquos Open Network Environment ndash LHCONE (2)
  • LHCONE is one part of the network infrastructure that supports
  • 7) New network services
  • Point-to-Point Virtual Circuit Service
  • Point-to-Point Virtual Circuit Service (2)
  • End User View of Circuits ndash How They Use Them
  • End User View of Circuits ndash How They Use Them (2)
  • Cross-Domain Virtual Circuit Service
  • Inter-Domain Control Protocol
  • Point-to-Point Virtual Circuit Service (3)
  • 8) Provide RampD consulting and knowledge base
  • Provide RampD consulting and knowledge base
  • The knowledge base
  • The Message
  • Infrastructure Critical to Science
  • LHC lessons of possible use to the SKA
  • LHC lessons of possible use to the SKA (2)
  • LHC lessons of possible use to the SKA (3)
  • LHC lessons of possible use to the SKA (4)
  • LHC lessons of possible use to the SKA (5)
  • The Message (2)
  • References
  • References (2)
  • References (3)