Foundations of data-intensive science: Technology and practice for high throughput, widely distributed, data management and analysis systems American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014 W. Johnston, E. Dart, M. Ernst*, and B. Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley, California, U.S.A and *Brookhaven National Laboratory Upton, New York, USA
78
Embed
American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014
Foundations of data-intensive science: Technology and practice for high throughput, widely distributed, data management and analysis systems. American Astronomical Society Topical Conference Series: Exascale Radio Astronomy Monterey, CA March 30 – April 4, 2014. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Foundations of data-intensive science Technology and practice for
high throughput widely distributed data management and analysis
systemsAmerican Astronomical Society Topical Conference SeriesExascale Radio AstronomyMonterey CAMarch 30 ndash April 4 2014
W Johnston E Dart M Ernst and B Tierney
ESnet and Lawrence Berkeley National Laboratory
Berkeley California USA
and
Brookhaven National Laboratory
Upton New York USA
2
Data-Intensive Science in DOErsquos Office of ScienceThe US Department of Energyrsquos Office of Science (ldquoSCrdquo)
supports about half of all civilian RampD in the US with about $5Byear in funding (with the National Science Foundation (NSF) funding the other half)ndash Funds some 25000 PhDs and PostDocs in the university
environmentndash Operates ten National Laboratories and dozens of major
scientific user facilities such as synchrotron light sources neutron sources particle accelerators electron and atomic force microscopes supercomputer centers etc that are all available to the US and Global science research community and many of which generate massive amounts of data and involve large distributed collaborations
ndash Supports global large-scale science collaborations such as the LHC at CERN and the ITER fusion experiment in France
ndash wwwsciencedoegov
3
DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an
SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and
computational steeringndash Collaboration with the US and International
Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring
that the network can be used effectively to provide all mission required access to data and computing
bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab
Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities
4
HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management
and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex
instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the
number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international
ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs
bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area
ndash the data output of these increasingly sophisticated sensors has increased exponentially
bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge
5
HEP as a Prototype for Data-Intensive Science
Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation
HEP data volumes for leading experimentswith Belle-II estimates
LHC down for upgrade
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
2
Data-Intensive Science in DOErsquos Office of ScienceThe US Department of Energyrsquos Office of Science (ldquoSCrdquo)
supports about half of all civilian RampD in the US with about $5Byear in funding (with the National Science Foundation (NSF) funding the other half)ndash Funds some 25000 PhDs and PostDocs in the university
environmentndash Operates ten National Laboratories and dozens of major
scientific user facilities such as synchrotron light sources neutron sources particle accelerators electron and atomic force microscopes supercomputer centers etc that are all available to the US and Global science research community and many of which generate massive amounts of data and involve large distributed collaborations
ndash Supports global large-scale science collaborations such as the LHC at CERN and the ITER fusion experiment in France
ndash wwwsciencedoegov
3
DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an
SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and
computational steeringndash Collaboration with the US and International
Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring
that the network can be used effectively to provide all mission required access to data and computing
bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab
Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities
4
HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management
and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex
instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the
number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international
ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs
bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area
ndash the data output of these increasingly sophisticated sensors has increased exponentially
bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge
5
HEP as a Prototype for Data-Intensive Science
Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation
HEP data volumes for leading experimentswith Belle-II estimates
LHC down for upgrade
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
3
DOE Office of Science and ESnet ndash the ESnet Mission ESnet - the Energy Sciences Network - is an
SC program whose primary mission is to enable the large-scale science of the Office of Science that depends onndash Multi-institution world-wide collaboration Data mobility sharing of massive amounts of datandash Distributed data management and processingndash Distributed simulation visualization and
computational steeringndash Collaboration with the US and International
Research and Education community ldquoEnabling large-scale sciencerdquo means ensuring
that the network can be used effectively to provide all mission required access to data and computing
bull ESnet connects the Office of Science National Laboratories and user facilities to each other and to collaborators worldwidendash Ames Argonne Brookhaven Fermilab
Lawrence Berkeley Oak Ridge Pacific Northwest Princeton Plasma Physics SLAC and Thomas Jefferson National Accelerator Facilityand embedded and detached user facilities
4
HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management
and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex
instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the
number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international
ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs
bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area
ndash the data output of these increasingly sophisticated sensors has increased exponentially
bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge
5
HEP as a Prototype for Data-Intensive Science
Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation
HEP data volumes for leading experimentswith Belle-II estimates
LHC down for upgrade
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
4
HEP as a Prototype for Data-Intensive ScienceThe history of high energy physics (HEP) data management
and analysis anticipates many other science disciplines Each new generation of experimental science requires more complex
instruments to ferret out more and more subtle aspects of the science As the sophistication size and cost of the instruments increase the
number of such instruments becomes smaller and the collaborations become larger and more widely distributed ndash and mostly international
ndash These new instruments are based on increasingly sophisticated sensors which now are largely solid-state devices akin to CCDs
bull In many ways the solid-state sensors follow Moorersquos law just as computer CPUs do The number of transistors doubles per unit area of silicon every 18 mo and therefore the amount of data coming out doubles per unit area
ndash the data output of these increasingly sophisticated sensors has increased exponentially
bull Large scientific instruments only differ from CPUs in that the time between science instrument refresh is more like 10-20 years and so the increase in data volume from instrument to instrument is huge
5
HEP as a Prototype for Data-Intensive Science
Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation
HEP data volumes for leading experimentswith Belle-II estimates
LHC down for upgrade
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
5
HEP as a Prototype for Data-Intensive Science
Data courtesy of Harvey Newman Caltech and Richard Mount SLAC and Belle II CHEP 2012 presentation
HEP data volumes for leading experimentswith Belle-II estimates
LHC down for upgrade
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
6
HEP as a Prototype for Data-Intensive Sciencebull What is the significance to the network of this increase in databull Historically the use of the network by science has tracked the
size of the data sets used by science
ldquoHEP data collectedrdquo 2012 estimate (green line) in previous
slide
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
7
HEP as a Prototype for Data-Intensive ScienceAs the instrument size and data volume have gone up the
methodology for analyzing the data has had to evolvendash The data volumes from the early experiments were low enough that
the data was analyzed locallyndash As the collaborations grew to several institutions and the data
analysis shared among them the data was distributed by shipping tapes around
ndash As the collaboration sizes grew and became intercontinental the HEP community began to use networks to coordinate the collaborations and eventually to send the data around
The LHC data model assumed network transport of all data from the beginning (as opposed to shipping media)
Similar changes are occurring in most science disciplines
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
8
HEP as a Prototype for Data-Intensive Sciencebull Two major proton experiments (detectors) at the LHC ATLAS
and CMSbull ATLAS is designed to observe a billion (1x109) collisionssec
with a data rate out of the detector of more than 1000000 Gigabytessec (1 PBys)
bull A set of hardware and software filters at the detector reduce the output data rate to about 25 Gbs that must be transported managed and analyzed to extract the sciencendash The output data rate for CMS is about the same for a combined
50 Gbs that is distributed to physics groups around the world 7x24x~9moyr
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
The LHC data management model involves a world-wide collection of centers that store manage and analyze the data
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs ndash 8 Pbs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC(one of two detectors)
LHC Tier 0
Taiwan Canada USA-Atlas USA-CMS
Nordic
UKNetherlands Germany Italy
Spain
FranceCERN
Tier 1 centers hold working data
Tape115 PBy
Disk60 PBy
Cores68000
Tier 2 centers are data caches and analysis sites
0
(WLCG
120 PBy
2012)
175000
3 X data outflow vs inflow
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
10
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
11
HEP as a Prototype for Data-Intensive ScienceThe capabilities required to support this scale of data
movement involve hardware and software developments at all levels1 The underlying network
1a Optical signal transport1b Network routers and switches
2 Data transport (TCP is a ldquofragile workhorserdquo but still the norm)3 Network monitoring and testing4 Operating system evolution5 New site and network architectures6 Data movement and management techniques and software7 New network services
bull Technology advances in these areas have resulted in todayrsquos state-of-the-art that makes it possible for the LHC experiments to routinely and continuously move data at ~150 Gbs across three continents
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
12
HEP as a Prototype for Data-Intensive Sciencebull ESnet has been collecting requirements for all DOE science
disciplines and instruments that rely on the network for distributed data management and analysis for more than a decade and formally since 2007 [REQ] In this process certain issues are seen across essentially all science
disciplines that rely on the network for significant data transfer even if the quantities are modest compared to project like the LHC experiments
Therefore addressing the LHC issues is a useful exercise that can benefit a wide range of science disciplines
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
SKA data flow model is similar to the LHCreceptorssensors
correlator data processor
supercomputer
European distribution point
~200km avg
~1000 km
~25000 km(Perth to London via USA)
or~13000 km
(South Africa to London)
Regionaldata center
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
Universitiesastronomy
groups
93 ndash 168 Pbs
400 Tbs
100 Gbs
from SKA RFI
Hypothetical(based on the
LHC experience)
These numbers are based on modeling prior to splitting the
SKA between S Africa and Australia)
Regionaldata center
Regionaldata center
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
14
Foundations of data-intensive sciencebull This talk looks briefly at the nature of the advances in
technologies software and methodologies that have enabled LHC data management and analysis The points 1a and 1b on optical transport and router technology are
included in the slides for completeness but I will not talk about them They were not really driven by the needs of the LHC but they were opportunistically used by the LHC
Much of the reminder of the talk is a tour through ESnetrsquos network performance knowledge base (fasterdataesnet)
ndash Also included arebull the LHC ATLAS data management and analysis approach that generates
and relies on very large network data utilizationbull and an overview of how RampE network have evolved to accommodate the
LHC traffic
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
1) Underlying network issuesAt the core of our ability to transport the volume of data
that we must deal with today and to accommodate future growth are advances in optical transport technology and
router technology
0
5
10
15
Peta
byte
sm
onth
13 years
We face a continuous growth of data to transport
ESnet has seen exponential growth in our traffic every year since 1990 (our traffic grows by factor of 10 about once every 47 months)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
16
We face a continuous growth of data transportbull The LHC data volume is predicated to grow 10 fold over the
next 10 yearsNew generations of instruments ndash for example the Square
Kilometer Array radio telescope and ITER (the international fusion experiment) ndash will generate more data than the LHC
In response ESnet and most large RampE networks have built 100 Gbs (per optical channel) networksndash ESnets new network ndash ESnet5 ndash is complete and provides a 44 x
100Gbs (44 terabitssec - 4400 gigabitssec) in optical channels across the entire ESnet national footprint
ndash Initially one of these 100 Gbs channels is configured to replace the current 4 x 10 Gbs IP network
bull What has made this possible
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
17
1a) Optical Network TechnologyModern optical transport systems (DWDM = dense wave
division multiplexing) use a collection of technologies called ldquocoherent opticalrdquo processing to achieve more sophisticated optical modulation and therefore higher data density per signal transport unit (symbol) that provides 100Gbs per wave (optical channel)ndash Optical transport using dual polarization-quadrature phase shift keying
(DP-QPSK) technology with coherent detection [OIF1]bull dual polarization
ndash two independent optical signals same frequency orthogonal two polarizations rarr reduces the symbol rate by half
bull quadrature phase shift keying ndash encode data by changing the signal phase of the relative to the optical carrier further reduces the symbol rate by half (sends twice as much data symbol)
Together DP and QPSK reduce required rate by a factor of 4ndash allows 100G payload (plus overhead) to fit into 50GHz of spectrum
bull Actual transmission rate is about 10 higher to include FEC data
ndash This is a substantial simplification of the optical technology involved ndash see the TNC 2013 paper and Chris Tracyrsquos NANOG talk for details [Tracy1] and [Rob1]
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
WaveLogictrade to provide 100Gbs wavendash 88 waves (optical channels) 100Gbs each
bull wave capacity shared equally with Internet2ndash ~13000 miles 21000 km lit fiberndash 280 optical amplifier sitesndash 70 optical adddrop sites (where routers can be inserted)
bull 46 100G adddrop transpondersbull 22 100G re-gens across wide-area
NEWG
SUNN
KANSDENV
SALT
BOIS
SEAT
SACR
WSAC
LOSA
LASV
ELPA
ALBU
ATLA
WASH
NEWY
BOST
SNLL
PHOE
PAIX
NERSC
LBNLJGI
SLAC
NASHCHAT
CLEV
EQCH
STA
R
ANLCHIC
BNL
ORNL
CINC
SC11
STLO
Internet2
LOUI
FNA
L
Long IslandMAN and
ANI Testbed
O
JACKGeography is
only representational
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
19
1b) Network routers and switchesESnet5 routing (IP layer 3) is provided by Alcatel-Lucent
7750 routers with 100 Gbs client interfacesndash 17 routers with 100G interfaces
bull several more in a test environment ndash 59 layer-3 100GigE interfaces 8 customer-owned 100G routersndash 7 100G interconnects with other RampE networks at Starlight (Chicago)
MAN LAN (New York) and Sunnyvale (San Francisco)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
20
Metro area circuits
SNLL
PNNL
MIT
PSFC
AMES
LLNL
GA
JGI
LBNL
SLACNER
SC
ORNL
ANLFNAL
SALT
INL
PU Physics
SUNN
SEAT
STAR
CHIC
WASH
ATLA
HO
US
BOST
KANS
DENV
ALBQ
LASV
BOIS
SAC
R
ELP
A
SDSC
10
Geographical representation is
approximate
PPPL
CH
AT
10
SUNN STAR AOFA100G testbed
SF Bay Area Chicago New York AmsterdamAMST
US RampE peerings
NREL
Commercial peerings
ESnet routers
Site routers
100G
10-40G
1G Site provided circuits
LIGO
Optical only
SREL
100thinsp
Intrsquol RampE peerings
100thinsp
JLAB
10
10100thinsp
10
100thinsp100thinsp
1
10100thinsp
100thinsp1
100thinsp100thinsp
100thinsp
100thinsp
BNL
NEWY
AOFA
NASH
1
LANL
SNLA
10
10
1
10
10
100thinsp
100thinsp
100thinsp10
1010
100thinsp
100thinsp
10
10
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp
100thinsp100thinsp
100thinsp
10
100thinsp
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
2) Data transport The limitations of TCP must be addressed for large long-distance flows
Although there are other transport protocols available TCP remains the workhorse of the Internet including for data-
intensive scienceUsing TCP to support the sustained long distance high data-
rate flows of data-intensive science requires an error-free network
Why error-freeTCP is a ldquofragile workhorserdquo It is very sensitive to packet loss (due to bit errors)ndash Very small packet loss rates on these paths result in large decreases
in performance)ndash A single bit error will cause the loss of a 1-9 KBy packet (depending
on the MTU size) as there is no FEC at the IP level for error correctionbull This puts TCP back into ldquoslow startrdquo mode thus reducing throughput
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
22
Transportbull The reason for TCPrsquos sensitivity to packet loss is that the
slow-start and congestion avoidance algorithms that were added to TCP to prevent congestion collapse of the Internetndash Packet loss is seen by TCPrsquos congestion control algorithms as
evidence of congestion so they activate to slow down and prevent the synchronization of the senders (which perpetuates and amplifies the congestion leading to network throughput collapse)
ndash Network link errors also cause packet loss so these congestion avoidance algorithms come into play with dramatic effect on throughput in the wide area network ndash hence the need for ldquoerror-freerdquo
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
23
Transport Impact of packet loss on TCPOn a 10 Gbs LAN path the impact of low packet loss rates is
minimalOn a 10Gbs WAN path the impact of low packet loss rates is
enormous (~80X throughput reduction on transatlantic path)
Implications Error-free paths are essential for high-volume long-distance data transfers
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss
Reno (measured)
Reno (theory)
H-TCP(measured)
No packet loss
(see httpfasterdataesnetperformance-testingperfso
nartroubleshootingpacket-loss)
Network round trip time ms (corresponds roughly to San Francisco to London)
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Thro
ughp
ut M
bs
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
24
Transport Modern TCP stackbull A modern TCP stack (the kernel implementation of the TCP
protocol) is important to reduce the sensitivity to packet loss while still providing congestion avoidance (see [HPBulk])ndash This is done using mechanisms that more quickly increase back to full
speed after an error forces a reset to low bandwidth
ldquoBinary Increase Congestionrdquo control algorithm impact
Note that BIC reaches max throughput much faster than older algorithms (from Linux 2619 the
default is CUBIC a refined version of BIC designed for high bandwidth
long paths)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
25
Transport Modern TCP stackEven modern TCP stacks are only of some help in the face of
packet loss on a long path high-speed network
bull For a detailed analysis of the impact of packet loss on various TCP implementations see ldquoAn Investigation into Transport Protocols and Data Transport Applications Over High Performance Networksrdquo chapter 8 (ldquoSystematic Tests of New-TCP Behaviourrdquo) by Yee-Ting Li University College London (PhD thesis) httpwwwslacstanfordedu~ytlthesispdf
Reno (measured)
Reno (theory)
H-TCP (CUBIC refinement)(measured)
Throughput vs increasing latency on a 10Gbs link with 00046 packet loss(tail zoom)
Roundtrip time ms (corresponds roughly to San Francisco to London)
1000
900800700600500400300200100
0
Thro
ughp
ut M
bs
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
26
3) Monitoring and testingThe only way to keep multi-domain international scale networks error-free is to test and monitor continuously
end-to-end to detect soft errors and facilitate their isolation and correction
perfSONAR provides a standardize way to test measure export catalogue and access performance data from many different network domains (service providers campuses etc)
bull perfSONAR is a community effort tondash define network management data exchange protocols andndash standardized measurement data formats gathering and archiving
perfSONAR is deployed extensively throughout LHC related networks and international networks and at the end sites(See [fasterdata] [perfSONAR] and [NetSrv])
ndash There are now more than 1000 perfSONAR boxes installed in N America and Europe
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
27
perfSONARThe test and monitor functions can detect soft errors that limit
throughput and can be hard to find (hard errors faults are easily found and corrected)
Soft failure examplebull Observed end-to-end performance degradation due to soft failure of single optical line card
Gb
s
normal performance
degrading performance
repair
bull Why not just rely on ldquoSNMPrdquo interface stats for this sort of error detectionbull not all error conditions show up in SNMP interface statisticsbull SNMP error statistics can be very noisybull some devices lump different error counters into the same bucket so it can be very
challenging to figure out what errors to alarm on and what errors to ignorebull though ESnetrsquos Spectrum monitoring system attempts to apply heuristics to do this
bull many routers will silently drop packets - the only way to find that is to test through them and observe loss using devices other than the culprit device
one month
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
28
perfSONARThe value of perfSONAR increases dramatically as it is
deployed at more sites so that more of the end-to-end (app-to-app) path can characterized across multiple network domains provides the only widely deployed tool that can monitor circuits end-
to-end across the different networks from the US to Europendash ESnet has perfSONAR testers installed at every PoP and all but the
smallest user sites ndash Internet2 is close to the same
bull perfSONAR comes out of the work of the Open Grid Forum (OGF) Network Measurement Working Group (NM-WG) and the protocol is implemented using SOAP XML messages
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
29
4) System software evolution and optimizationOnce the network is error-free there is still the issue of
efficiently moving data from the application running on a user system onto the network
bull Host TCP tuningbull Modern TCP stack (see above)bull Other issues (MTU etc)
bull Data transfer tools and parallelism
bull Other data transfer issues (firewalls etc)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
30
41) System software tuning Host tuning ndash TCPbull ldquoTCP tuningrdquo commonly refers to the proper configuration of
TCP windowing buffers for the path lengthbull It is critical to use the optimal TCP send and receive socket
buffer sizes for the path (RTT) you are using end-to-endDefault TCP buffer sizes are typically much too small for
todayrsquos high speed networksndash Until recently default TCP sendreceive buffers were typically 64 KBndash Tuned buffer to fill CA to NY 1 Gbs path 10 MB
bull 150X bigger than the default buffer size
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
31
System software tuning Host tuning ndash TCPbull Historically TCP window size tuning parameters were host-
global with exceptions configured per-socket by applicationsndash How to tune is a function of the application and the path to the
destination so potentially a lot of special cases
Auto-tuning TCP connection buffer size within pre-configured limits helps
Auto-tuning however is not a panacea because the upper limits of the auto-tuning parameters are typically not adequate for high-speed transfers on very long (eg international) paths
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
32
System software tuning Host tuning ndash TCP
Throughput out to ~9000 km on a 10Gbs network32 MBy (auto-tuned) vs 64 MBy (hand tuned) TCP window size
hand tuned to 64 MBy window
Roundtrip time ms (corresponds roughlyto San Francisco to London)
path length
10000900080007000600050004000300020001000
0
Thro
ughp
ut M
bs
auto tuned to 32 MBy window
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
33
42) System software tuning Data transfer toolsParallelism is key in data transfer tools
ndash It is much easier to achieve a given performance level with multiple parallel connections than with one connection
bull this is because the OS is very good at managing multiple threads and less good at sustained maximum performance of a single thread (same is true for disks)
ndash Several tools offer parallel transfers (see below)
Latency tolerance is criticalndash Wide area data transfers have much higher latency than LAN
transfersndash Many tools and protocols assume latencies typical of a LAN
environment (a few milliseconds) examples SCPSFTP and HPSS mover protocols work very poorly in long
path networks
bull Disk Performancendash In general need a RAID array or parallel disks (like FDT) to get more
than about 500 Mbs
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
34
System software tuning Data transfer toolsUsing the right tool is very importantSample Results Berkeley CA to Argonne IL
RTT = 53 ms network capacity = 10GbpsTool Throughput
bull scp 140 Mbpsbull patched scp (HPN) 12 Gbpsbull ftp 14 Gbpsbull GridFTP 4 streams 54 Gbpsbull GridFTP 8 streams 66 GbpsNote that to get more than about 1 Gbps (125 MBs) disk to disk requires using RAID technology
bull PSC (Pittsburgh Supercomputer Center) has a patch set that fixes problems with SSHndash httpwwwpscedunetworkingprojectshpn-sshndash Significant performance increase
bull this helps rsync too
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
35
System software tuning Data transfer toolsGlobus GridFTP is the basis of most modern high-
performance data movement systems Parallel streams buffer tuning help in getting through firewalls (open
ports) ssh etc The newer Globus Online incorporates all of these and small file
support pipelining automatic error recovery third-party transfers etcbull This is a very useful tool especially for the applications community
outside of HEP
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
36
System software tuning Data transfer toolsAlso see Caltechs FDT (Faster Data Transfer) approach
ndash Not so much a tool as a hardwaresoftware system designed to be a very high-speed data transfer node
ndash Explicit parallel use of multiple disksndash Can fill 100 Gbs pathsndash See SC 2011 bandwidth challenge results and
httpmonalisacernchFDT
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
37
44) System software tuning Other issuesFirewalls are anathema to high-peed data flows
ndash many firewalls canrsquot handle gt1 Gbs flowsbull designed for large number of low bandwidth flowsbull some firewalls even strip out TCP options that allow for
TCP buffers gt 64 KB See Jason Zurawskirsquos ldquoSay Hello to your Frienemy ndash The Firewallrdquo
Stateful firewalls have inherent problems that inhibit high throughputbull httpfasterdataesnetassetsfasterdataFirewall-tcptracepdf
bull Many other issuesndash Large MTUs (several issues)ndash NIC tuning
bull Defaults are usually fine for 1GE but 10GE often requires additional tuning
ndash Other OS tuning knobsndash See fasterdataesnet and ldquoHigh Performance Bulk Data Transferrdquo
([HPBulk])
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
5) Site infrastructure to support data-intensive scienceThe Science DMZ
With the wide area part of the network infrastructure addressed the typical sitecampus LAN becomes the
bottleneckThe site network (LAN) typically provides connectivity for local
resources ndash compute data instrument collaboration system etc ndash needed by data-intensive sciencendash Therefore a high performance interface between the wide area network
and the local area site network is critical for large-scale data movement
Campus network infrastructure is typically not designed to handle the flows of large-scale sciencendash The devices and configurations typically deployed to build LAN networks
for business and small data-flow purposes usually donrsquot work for large-scale data flows
bull firewalls proxy servers low-cost switches and so forth bull none of which will allow high volume high bandwidth long distance data
flows
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
39
The Science DMZTo provide high data-rate access to local resources the site
LAN infrastructure must be re-designed to match the high-bandwidth large data volume high round trip time (RTT) (international paths) of the wide area network (WAN) flows (See [DIS])ndash otherwise the site will impose poor performance on the entire high
speed data path all the way back to the source
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
40
The Science DMZThe ScienceDMZ concept
The compute and data resources involved in data-intensive sciences should be deployed in a separate portion of the site network that has a different packet forwarding path that uses WAN-like technology and has a tailored security policy Outside the site firewall ndash hence the term ldquoScienceDMZrdquo With dedicated systems built and tuned for wide-area data transfer With test and measurement systems for performance verification and
rapid fault isolation typically perfSONAR (see [perfSONAR] and below)
A security policy tailored for science traffic and implemented using appropriately capable hardware (eg that support access control lists private address space etc)
This is so important it was a requirement for last round of NSF CC-NIE grants
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
41
The Science DMZ
(See httpfasterdataesnetscience-dmz
and [SDMZ] for a much more complete
discussion of the various approaches)
campus siteLAN
high performanceData Transfer Node
computing cluster
cleanhigh-bandwidthWAN data path
campussiteaccess to
Science DMZresources is via the site firewall
secured campussiteaccess to Internet
border routerWAN
Science DMZrouterswitch
campus site
Science DMZ
Site DMZ WebDNS
Mail
network monitoring and testing
A WAN-capable device
per-servicesecurity policycontrol points
site firewall
dedicated systems built and
tuned for wide-area data transfer
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
42
6) Data movement and management techniquesAutomated data movement is critical for moving 500
terabytesday between 170 international sites In order to effectively move large amounts of data over the
network automated systems must be used to manage workflow and error recovery
bull The filtered ATLAS data rate of about 25 Gbs is sent to 10 national Tier 1 data centers
bull The Tier 2 sites get a comparable amount of data from the Tier 1s ndash Host the physics groups that analyze the data and do the sciencendash Provide most of the compute resources for analysisndash Cache the data (though this is evolving to remote IO)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
43
Highly distributed and highly automated workflow systemsbull The ATLAS experiment system (PanDA) coordinates the
analysis resources and the data managementndash The resources and data movement are centrally managedndash Analysis jobs are submitted to the central manager that locates
compute resources and matches these with dataset locationsndash The system manages 10s of thousands of jobs a day
bull coordinates data movement of hundreds of terabytesday andbull manages (analyzes generates moves stores) of order 10
petabytes of datayear in order to accomplish its science
bull The complexity of the distributed systems that have to coordinate the computing and data movement for data analysis at the hundreds of institutions spread across three continents involved in the LHC experiments is substantial
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
44
DDMAgent
DDMAgent
ATLAS production
jobs
Regional production
jobs
User Group analysis jobs
Data Service
Task Buffer(job queue)
Job Dispatcher
PanDA Server(task management)
Job Broker
Policy(job type priority)
ATLA
S Ti
er 1
Data
Cen
ters
11 s
ites
scat
tere
d ac
ross
Euro
pe N
orth
Am
erica
and
Asia
in
aggr
egat
e ho
ld 1
copy
of a
ll dat
a an
d pr
ovide
the
work
ing
data
set f
or d
istrib
ution
to T
ier 2
cen
ters
for a
nalys
isDistributed
Data Manager
Pilot Job(Panda job
receiver running under the site-
specific job manager)
Grid Scheduler
Site Capability Service
CERNATLAS detector
Tier 0 Data Center(1 copy of all data ndash
archival only)
Job resource managerbull Dispatch a ldquopilotrdquo job manager - a
Panda job receiver - when resources are available at a site
bull Pilots run under the local site job manager (eg Condor LSF LCGhellip) and accept jobs in a standard format from PanDA
bull Similar to the Condor Glide-in approach
Site status
ATLAS analysis sites(eg 70 Tier 2 Centers in
Europe North America and SE Asia)
DDMAgent
DDMAgent
1) Schedules jobs initiates data movement
2) DDM locates data and moves it to sites
This is a complex system in its own right called DQ2
3) Prepares the local resources to receive Panda jobs
4) Jobs are dispatched when there are resources available and when the required data is
in place at the site
Thanks to Michael Ernst US ATLAS technical lead for his assistance with this
diagram and to Torre Wenaus whose view graphs provided the starting point (Both are at Brookhaven National Lab)
The ATLAS PanDA ldquoProduction and Distributed Analysisrdquo system uses distributed resources and layers of automation to manage several million jobsday
CERN
Try to move the job to where the data is else move data and job to where
resources are available
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
45
Scale of ATLAS analysis driven data movement
The PanDA jobs executing at centers all over Europe N America and SE Asia generate network
data movement of 730 TByday ~68Gbs
Accumulated data volume on disk
730 TBytesday
PanDA manages 120000ndash140000 simultaneous jobs (PanDA manages two types of jobs that are shown separately here)
It is this scale of data movementgoing on 24 hrday 9+ monthsyr
that networks must support in order to enable the large-scale science of the LHC
0
50
100
150
Peta
byte
s
four years
0
50000
100000
type
2jo
bs
0
50000
type
1jo
bs
one year
one year
one year
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
46
Building an LHC-scale production analysis system In order to debug and optimize the distributed system that
accomplishes the scale of the ATLAS analysis years were spent building and testing the required software and hardware infrastructurendash Once the systems were in place systematic testing was carried out in
ldquoservice challengesrdquo or ldquodata challengesrdquondash Successful testing was required for sites to participate in LHC
production
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
47
Ramp-up of LHC traffic in ESnet
(est of ldquosmallrdquo scale traffic)
LHC
turn
-on
LHC data systemtesting
LHC operationThe transition from testing to operation
was a smooth continuum due toat-scale testing ndash a process that took
more than 5 years
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
48
6 cont) Evolution of network architecturesFor sustained high data-rate transfers ndash eg from instrument
to data centers ndash a dedicated purpose-built infrastructure is needed
bull The transfer of LHC experiment data from CERN (Tier 0) to the 11 national data centers (Tier 1) uses a network called LHCOPNndash The LHCOPN is a collection of leased 10Gbs optical circuits The role of LHCOPN is to ensure that all data moves from CERN to
the national Tier 1 data centers continuouslybull In addition to providing the working dataset for the analysis groups the
Tier 1 centers in aggregate hold a duplicate copy of the data that is archived at CERN
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
49
The LHC OPN ndash Optical Private Networkbull While the LHCOPN was a technically straightforward
exercise ndash establishing 10 Gbs links between CERN and the Tier 1 data centers for distributing the detector output data ndash there were several aspects that were new to the RampE community
bull The issues related to the fact that most sites connected to the RampE WAN infrastructure through a site firewall and the OPN was intended to bypass site firewalls in order to achieve the necessary performance The security issues were the primarily ones and were addressed by
bull Using a private address space that hosted only LHC Tier 1 systems (see [LHCOPN Sec])
ndash that is only LHC data and compute servers are connected to the OPN
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
50
The LHC OPN ndash Optical Private Network
UK-T1_RAL
NDGF
FR-CCIN2P3
ES-PIC
DE-KIT
NL-T1
US-FNAL-CMS
US-T1-BNL
CA-TRIUMF
TW-ASCG
IT-NFN-CNAF
CH-CERNLHCOPN physical (abbreviated)
LHCOPN architecture
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
51
The LHC OPN ndash Optical Private NetworkNBbull In 2005 the only way to handle the CERN (T0) to Tier 1
centers data transfer was to use dedicated physical 10G circuits
Today in most RampE networks (where 100 Gbs links are becoming the norm) the LHCOPN could be provided using virtual circuits implemented with MPLS or OpenFlow network overlaysndash The ESnet part of the LHCOPN has used this approach for more than
5 years ndash in fact this is what ESnetrsquos OSCARS virtual circuit system was originally designed for (see below)
ndash However such an international-scale virtual circuit infrastructure would have to be carefully tested before taking over the LHCOPN role
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
Managing large-scale science traffic in a shared infrastructure
The traffic from the Tier 1 data centers to the Tier 2 sites (mostly universities) where the data analysis is done is now large enough that it must be managed separately from the general RampE trafficndash In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1ndash (there are about 170 Tier 2 sites)
bull Managing this with all possible combinations of Tier 2 ndash Tier 2 flows (potentially 170 x 170) cannot be done just using a virtual circuit service ndash it is a relatively heavy-weight mechanism
bull Special infrastructure is required for this The LHCrsquos Open Network Environment ndash LHCONE ndash was designed for this purpose
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
53
The LHCrsquos Open Network Environment ndash LHCONELHCONE provides a private managed infrastructure
designed for LHC Tier 2 traffic (and likely other large-data science projects in the future)
The approach is an overlay network whose architecture is a collection of routed ldquocloudsrdquo using address spaces restricted to subnets that are used by LHC systemsndash The clouds are mostly local to a network domain (eg one for each
involved domain ndash ESnet GEANT (ldquofrontsrdquo for the NRENs) Internet2 (fronts for the US universities) etc
ndash The clouds (VRFs) are interconnected by point-to-point circuits provided by various entities (mostly the domains involved)
In this way the LHC traffic will use circuits designated by the network engineersndash To ensure continued good performance for the LHC and to ensure
that other traffic is not impacted ndash this is critical because apart from the LHCOPN the RampE networks are funded for the benefit of the entire RampE community not just the LHC
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
54
ESnetUSA
Chicago
New YorkBNL-T1
Internet2USA
Harvard
CANARIECanada
UVic
SimFraU
TRIUMF-T1
UAlb UTorMcGilU
Seattle
TWARENTaiwan
NCU NTU
ASGCTaiwan
ASGC-T1
KERONET2Korea
KNU
LHCONE VRF domain
End sites ndash LHC Tier 2 or Tier 3 unless indicated as Tier 1
Regional RampE communication nexus
Data communication links 10 20 and 30 Gbs
See httplhconenet for details
NTU
Chicago
LHCONE A global infrastructure for the LHC Tier1 data center ndash Tier 2 analysis center connectivity
NORDUnetNordic
NDGF-T1aNDGF-T1a NDGF-T1c
DFNGermany
DESYGSI DE-KIT-T1
GARRItaly
INFN-Nap CNAF-T1
RedIRISSpain
PIC-T1
SARANetherlands
NIKHEF-T1
RENATERFrance
GRIF-IN2P3
Washington
CUDIMexico
UNAM
CC-IN2P3-T1Sub-IN2P3
CEA
CERNGeneva
CERN-T1
SLAC
GLakes
NE
MidWSoW
Geneva
KISTIKorea
TIFRIndia
India
Korea
FNAL-T1
MIT
CaltechUFlorida
UNebPurU
UCSDUWisc
UltraLightUMich
Amsterdam
GEacuteANT Europe
April 2012
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
55
The LHCrsquos Open Network Environment ndash LHCONEbull LHCONE could be set up relatively ldquoquicklyrdquo because
ndash The VRF technology is a standard capability in most core routers andndash there is capacity in the RampE community that can be made available for
use by the LHC collaboration that cannot be made available for general RampE traffic
bull LHCONE is essentially built as a collection of private overlay networks (like VPNs) that are interconnected by managed links to form a global infrastructure where Tier 2 traffic will get good service and not interfere with general traffic
bull From the point of view of the end sites they see a LHC-specific environment where they can reach all other LHC sites with good performance
bull See LHCONEnet
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
LHCONE is one part of the network infrastructure that supports the LHC
CERN rarrT1 miles kms
France 350 565
Italy 570 920
UK 625 1000
Netherlands 625 1000
Germany 700 1185
Spain 850 1400
Nordic 1300 2100
USA ndash New York 3900 6300
USA - Chicago 4400 7100
Canada ndash BC 5200 8400
Taiwan 6100 9850
CERN Computer Center
The LHC Optical Private Network
(LHCOPN)
LHC Tier 1Data Centers
LHC Tier 2 Analysis Centers
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups Universities
physicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
Universitiesphysicsgroups
The LHC Open Network
Environment(LHCONE)
50 Gbs (25Gbs ATLAS 25Gbs CMS)
detector
Level 1 and 2 triggers
Level 3 trigger
O(1-10) meter
O(10-100) meters
O(1) km
1 PBs
500-10000 km
This is intended to indicate that the physics
groups now get their datawherever it is most readily
available
A Network Centric View of the LHC
Taiwan Canada USA-Atlas USA-CMS
Nordic
UK
Netherlands Germany Italy
Spain
FranceCERN
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
57
7) New network servicesPoint-to-Point Virtual Circuit Service
Why a Circuit Servicebull Geographic distribution of resources is seen as a fairly
consistent requirement across the large-scale sciences in that they use distributed applications systems in order tondash Couple existing pockets of code data and expertise into ldquosystems of
systemsrdquondash Break up the task of massive data analysis and use data compute and
storage resources that are located at the collaboratorrsquos sitesndash See httpswwwesnetaboutscience-requirements
A commonly identified need to support this is that networking must be provided as a ldquoservicerdquo Schedulable with guaranteed bandwidth ndash as is done with CPUs and disksndash Traffic isolation that allows for using non-standard protocols that will not work
well in a shared infrastructurendash Some network path characteristics may also be specified ndash eg diversityndash Available in Web Services Grid Services paradigm
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
58
Point-to-Point Virtual Circuit ServiceThe way that networks provide such a service is with ldquovirtual
circuitsrdquo (also called pseudowires) that emulate point-to-point connections in a packet-switched network like the Internetndash This is typically done by using a ldquostaticrdquo routing mechanism
bull Eg some variation of label based switching with the static switch tables set up in advance to define the circuit path
ndash MPLS and OpenFlow are examples of this and both can transport IP packets
ndash Most modern Internet routers have this type of functionality
bull Such a service channels big data flows into virtual circuits in ways that also allow network operators to do ldquotraffic engineeringrdquo ndash that is to manageoptimize the use of available network resources and to keep big data flows separate from general trafficndash The virtual circuits can be directed to specific physical network paths
when they are set up
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
59
Point-to-Point Virtual Circuit Servicebull OSCARS is ESnetrsquos implementation of a virtual circuit service
(For more information contact the project lead Chin Guok chinesnet)
bull See ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo in TERENA Networking Conference 2011 in the references
bull OSCARS received a 2013 ldquoRampD 100rdquo award
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
60
End User View of Circuits ndash How They Use Thembull Who are the ldquousersrdquo
ndash Sites for the most part
bull How are the circuits usedndash End system to end system IP
bull Almost never ndash very hard unless private address space usedndash Using public address space can result in leaking routesndash Using private address space with multi-homed hosts risks allowing backdoors into secure
networks
ndash End system to end system Ethernet (or other) over VLAN ndash a pseudowirebull Relatively commonbull Interesting example RDMA over VLAN likely to be popular in the future
ndash SC11 demo of 40G RDMA over WAN was very successfulndash CPU load for RDMA is a small fraction that of IPndash The guaranteed network of circuits (zero loss no reordering etc) required by non-IP
protocols like RDMA fits nicely with circuit services (RDMA performs very poorly on best effort networks)
ndash Point-to-point connection between routing instance ndash eg BGP at the end points
bull Essentially this is how all current circuits are used from one site router to another site router
ndash Typically site-to-site or advertise subnets that host clusters eg LHC analysis or data management clusters
61
End User View of Circuits ndash How They Use Thembull When are the circuits used
ndash Mostly to solve a specific problem that the general infrastructure cannot
bull Most circuits are used for a guarantee of bandwidth or for user traffic engineering
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
network domains (administrative units)ndash For a circuit service to be useful it must operate across all RampE domains
involved in the science collaboration to provide and-to-end circuitsndash eg ESnet Internet2 (USA) CANARIE (Canada) GEacuteANT (EU) SINET
(Japan) CERNET and CSTNET (China) KREONET (Korea) TWAREN (Taiwan) AARNet (AU) the European NRENs the US Regionals etc are all different domains
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
63
Inter-Domain Control Protocolbull There are two realms involved
1 Domains controllers like OSCARS for routing scheduling and resource commitment within network domains
2 The inter-domain protocol that the domain controllers use between network domains where resources (link capacity) are likely shared and managed by pre-agreements between domains
FNAL (AS3152)[US]
ESnet (AS293)[US]
GEANT (AS20965)[Europe]
DFN (AS680)[Germany]
DESY (AS1754)[Germany]
Topology exchange
VC setup request
Local InterDomain
Controller
Local IDC
Local IDC
Local IDC
Local IDC
VC setup request
VC setup request
VC setup request
OSCARS
User source
User destination
VC setup request
data plane connection helper at each domain ingressegress point
data plane connection helper at each domain ingressegress point
1The domains exchange topology information containing at least potential VC ingress and egress points2VC setup request (via IDC protocol) is initiated at one end of the circuit and passed from domain to
domain as the VC segments are authorized and reserved3Data plane connection (eg Ethernet VLAN to VLAN connection) is facilitated by a helper process
The end-to-end virtual circuit
AutoBAHN
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
64
Point-to-Point Virtual Circuit Servicebull The Inter-Domain Control Protocol work has largely moved
into the Open Grid Forum Network Services Interface (NSI) Working Groupndash Testing is being coordinated in GLIF (Global Lambda Integrated
Facility - an international virtual organization that promotes the paradigm of lambda networking)
bull To recap The virtual circuit service provides the network as a ldquoservicerdquo that can be combined with other services eg cpu and storage scheduling in a Web Services Grids framework so that computing data access and data movement can all work together as a predictable system
Multi-domain circuit setup is not yet a robust production service but progress is being madebull See lhconenet
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
65
8) Provide RampD consulting and knowledge base bull RampD drove most of the advances that make it possible for the
network to support data-intensive sciencendash With each generation of network transport technology
bull 155 Mbs was the norm for high speed networks in 1995bull 100 Gbs ndash 650 times greater ndash is the norm todaybull RampD groups involving hardware engineers computer scientists and
application specialists worked tobull first demonstrate in a research environment that ldquofilling the network piperdquo
end-to-end (application to application) was possiblebull and then to do the development necessary for applications to make use of
the new capabilitiesndash Examples of how this methodology drove toward todayrsquos capabilities
includebull experiments in the 1990s in using parallel disk IO and parallel network IO
together to achieve 600 Mbs over OC12 (622 Mbs) wide area network paths
bull recent demonstrations of this technology to achieve disk-to-disk WAN data transfers at 100 Gbs
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
66
Provide RampD consulting and knowledge base bull Providing consulting on problems that data-intensive projects
are having in effectively using the network is criticalUsing the knowledge gained from the problem solving to build
a community knowledge base benefits everyoneThe knowledge base maintained by ESnet is at
httpfasterdataesnet and contains contributions from several organizations
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
67
The knowledge base httpfasterdataesnet topics
ndash Network Architecture including the Science DMZ modelndash Host Tuningndash Network Tuningndash Data Transfer Toolsndash Network Performance Testingndash With special sections on
bull Linux TCP Tuningbull Cisco 6509 Tuningbull perfSONAR Howtobull Active perfSONAR Servicesbull Globus overviewbull Say No to SCPbull Data Transfer Nodes (DTN)bull TCP Issues Explained
bull fasterdataesnet is a community project with contributions from several organizations
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
68
The MessageA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experiments But once this is done international high-speed data management can
be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
69
Infrastructure Critical to Sciencebull The combination of
ndash New network architectures in the wide areandash New network services (such as guaranteed bandwidth virtual circuits)ndash Cross-domain network error detection and correctionndash Redesigning the site LAN to handle high data throughputndash Automation of data movement systemsndash Use of appropriate operating system tuning and data transfer tools
now provides the LHC science collaborations with the data communications underpinnings for a unique large-scale widely distributed very high performance data management and analysis infrastructure that is an essential component in scientific discovery at the LHC
bull Other disciplines that involve data-intensive science will face most of these same issues
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
70
LHC lessons of possible use to the SKAThe similarities
bull The SKA collaboration like the LHC is a large international collaboration involving science groups in many different countries
bull The science data generation rates of the LHC and the SKA are roughly comparable as are the stored data volume which grow with time as the instrument takes data continuously
bull The data is generatedsent to a single location and then distributed to science groups
bull The data usage model for the SKA may be similar to the LHC A large data set that many different science groups may access in different ways to accomplish science
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
71
LHC lessons of possible use to the SKAThe lessons
The science data product (output of the supercomputer center in the SKA case) is likely too large to have the working data set in one locationndash A deep archive (tape only) copy is probably practical in one
location (eg the SKA supercomputer center) and this is done at CERN for the LHC
ndash The technical aspects of building and operating a centralized working data repository
bull a large mass storage system that has very large cache disks in order to satisfy current requests in an acceptable time
bull high speed WAN connections to accept all data from the telescope site and then send all of that data out to science sites
mitigates against a single large data center
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
72
LHC lessons of possible use to the SKAThe LHC model of a distributed data (multiple regional
centers) has worked wellndash It decentralizes costs and involves many countries directly in the
telescope infrastructurendash It divides up the network load especially on the expensive trans-
ocean linksndash It divides up the cache IO load across distributed sites
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
73
LHC lessons of possible use to the SKARegardless of distributed vs centralized working data
repository all of the attendant network lessons will apply There will have to be an LHCOPN-like network from the data source
to Tier 1 site(s)bull It might be that in the case of the SKA the T1 links would come to a
centralized data distribution only node ndash say in the UK ndash that accepts the 100 Gbs flow from the telescope site and then divides it up among the Tier 1 data centersor there might be virtual circuits from the SKA site to each T1 This choice is a cost and engineering issue
bull In any event the LHCOPN-like network will need to go from SKA site to Tier 1 data center(s)
If there are a lot of science analysis sites that draw heavily on the Tier 1 data then an LHCONE-like infrastructure is almost certainly going to be needed for the same reasons as for the LHC
ndash In fact it might well be that the SKA could use the LHCONE infrastructure ndash that is the specific intent of the how the RampE networks in the US eg are implementing LHCONE
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
74
LHC lessons of possible use to the SKAAll of the lessons from lsquoTCP is a ldquofragile workhorserdquorsquo
observation must be heeded All high bandwidth high data volume long RTT paths must be kept error-free with constant monitoring close cooperation of the RampE networks involved in providing parts of the path etc New transport protocols (there are lots of them available) could
address this problem but they are very difficult to introduce in to all of the different environments serving an international collaboration
Re-engineering the site LAN-WAN architecture is critical The ScienceDMZ
Workflow management systems that automate the data movement will have to designed and testedndash Readiness for instrument turn-on can only be achieved by using
synthetic data and ldquoservice challengesrdquo ndash simulated operation ndash building up to at-scale data movement well before instrument turn-on
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
75
The MessageAgain hellipA significant collection of issues must all be
addressed in order to achieve the sustained data movement needed to support data-intensive science such as the LHC experimentsBut once this is done international high-speed data
management can be done on a routine basis
Many of the technologies and knowledge from the LHC experience are applicable to other science disciplines that must manage a lot of data in a widely distributed environment ndash SKA ITER helliphellip
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
76
References[DIS] ldquoInfrastructure for Data Intensive Science ndash a bottom-up approach ldquoEli Dart and William Johnston
Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory To be published in Future of Data Intensive Science Kerstin Kleese van Dam and Terence Critchlow eds Also see httpfasterdataesnetfasterdatascience-dmz
[fasterdata] See httpfasterdataesnetfasterdataperfSONAR
[HPBulk] ldquoHigh Performance Bulk Data Transferrdquo Brian Tierney and Joe Metzger ESnet Joint Techs July 2010 Available at fasterdataesnetfasterdata-homelearn-more
[Jacobson] For an overview of this issue see httpenwikipediaorgwikiNetwork_congestionHistory
[LHCONE] httplhconenet
[LHCOPN Sec] at httpstwikicernchtwikibinviewLHCOPNWebHome see ldquoLHCOPN security policy documentrdquo
[NetServ] ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory In The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering 12‐15 April 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
77
References[OSCARS] ldquoIntra and Interdomain Circuit Provisioning Using the OSCARS Reservation Systemrdquo Chin Guok Robertson D Thompson M Lee J Tierney B Johnston W Energy Sci Network Lawrence Berkeley National Laboratory In BROADNETS 2006 3rd International Conference on Broadband Communications Networks and Systems 2006 ndash IEEE 1-5 Oct 2006 Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoNetwork Services for High Performance Distributed Computing and Data Managementrdquo W E Johnston C Guok J Metzger and B Tierney ESnet and Lawrence Berkeley National Laboratory Berkeley California USA The Second International Conference on Parallel Distributed Grid and Cloud Computing for Engineering12-15 April 2011 Ajaccio - Corsica ndash France Available at httpesnetnews-and-publicationspublications-and-presentations
ldquoMotivation Design Deployment and Evolution of a Guaranteed Bandwidth Network Servicerdquo William E Johnston Chin Guok and Evangelos Chaniotakis ESnet and Lawrence Berkeley National Laboratory Berkeley California USA In TERENA Networking Conference 2011 Available at httpesnetnews-and-publicationspublications-and-presentations
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223
Foundations of data-intensive science Technology and practice
Data-Intensive Science in DOErsquos Office of Science
DOE Office of Science and ESnet ndash the ESnet Mission
HEP as a Prototype for Data-Intensive Science
HEP as a Prototype for Data-Intensive Science (2)
HEP as a Prototype for Data-Intensive Science (3)
HEP as a Prototype for Data-Intensive Science (4)
HEP as a Prototype for Data-Intensive Science (5)
The LHC data management model involves a world-wide collection
Scale of ATLAS analysis driven data movement
HEP as a Prototype for Data-Intensive Science (6)
HEP as a Prototype for Data-Intensive Science (7)
SKA data flow model is similar to the LHC
Foundations of data-intensive science
1) Underlying network issues
We face a continuous growth of data transport
1a) Optical Network Technology
Optical Network Technology
1b) Network routers and switches
The Energy Sciences Network ESnet5 (Fall 2013)
2) Data transport The limitations of TCP must be addressed for
Transport
Transport Impact of packet loss on TCP
Transport Modern TCP stack
Transport Modern TCP stack (2)
3) Monitoring and testing
perfSONAR
perfSONAR (2)
4) System software evolution and optimization
41) System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
System software tuning Host tuning ndash TCP
42) System software tuning Data transfer tools
System software tuning Data transfer tools
System software tuning Data transfer tools (2)
System software tuning Data transfer tools (3)
44) System software tuning Other issues
5) Site infrastructure to support data-intensive science The Sc
The Science DMZ
The Science DMZ (2)
The Science DMZ (3)
6) Data movement and management techniques
Highly distributed and highly automated workflow systems
Slide 44
Scale of ATLAS analysis driven data movement (2)
Building an LHC-scale production analysis system
Ramp-up of LHC traffic in ESnet
6 cont) Evolution of network architectures
The LHC OPN ndash Optical Private Network
The LHC OPN ndash Optical Private Network (2)
The LHC OPN ndash Optical Private Network (3)
Managing large-scale science traffic in a shared infrastructure
The LHCrsquos Open Network Environment ndash LHCONE
Slide 54
The LHCrsquos Open Network Environment ndash LHCONE (2)
LHCONE is one part of the network infrastructure that supports
7) New network services
Point-to-Point Virtual Circuit Service
Point-to-Point Virtual Circuit Service (2)
End User View of Circuits ndash How They Use Them
End User View of Circuits ndash How They Use Them (2)
Cross-Domain Virtual Circuit Service
Inter-Domain Control Protocol
Point-to-Point Virtual Circuit Service (3)
8) Provide RampD consulting and knowledge base
Provide RampD consulting and knowledge base
The knowledge base
The Message
Infrastructure Critical to Science
LHC lessons of possible use to the SKA
LHC lessons of possible use to the SKA (2)
LHC lessons of possible use to the SKA (3)
LHC lessons of possible use to the SKA (4)
LHC lessons of possible use to the SKA (5)
The Message (2)
References
References (2)
References (3)
78
References[perfSONAR] See ldquoperfSONAR Instantiating a Global Network Measurement Frameworkrdquo B Tierney J Metzger J Boote A Brown M Zekauskas J Zurawski M Swany M Grigoriev In proceedings of 4th Workshop on Real Overlays and Distributed Systems (ROADS09) Co-located with the 22nd ACM Symposium on Operating Systems Principles (SOSP) October 2009 Available at httpesnetnews-and-publicationspublications-and-presentations
httpwwwperfsonarnet
httppspsperfsonarnet
[REQ] httpswwwesnetaboutscience-requirements
[Rob1] ldquo100G and beyond with digital coherent signal processingrdquo Roberts K Beckett D Boertjes D Berthold J Laperle C Ciena Corp Ottawa ON Canada Communications Magazine IEEE July 2010
(may be available at httpstaffwebcmsgreacuk~gm73com-magCOMG_20100701_Jul_2010PDF )
[SDMZ] see lsquoAchieving a Science DMZldquorsquo at httpfasterdataesnetassetsfasterdataScienceDMZ-Tutorial-Jan2012pdf and the podcast of the talk at httpeventsinternet2edu2012jt-loniagendacfmgo=sessionampid=10002160ampevent=1223