Advanced Network Services for Experiments (ANSE) Introduction: Progress, Outlook and Issues Harvey B Newman California Institute of Technology ANSE Annual Meeting Vanderbilt May 14, 2014
Jan 20, 2016
Advanced Network Services for Experiments (ANSE)
Introduction: Progress, Outlookand Issues
Harvey B NewmanCalifornia Institute of Technology
ANSE Annual MeetingVanderbilt
May 14, 2014
ANSE: Advanced Network Services for (LHC) Experiments
NSF CC-NIE Funded, 4 US Institutes: Caltech, Vanderbilt, Michigan, UT Arlington. A US ATLAS / US CMS Collaboration
Goal: provide more efficient, deterministic workflows
Method: Interface advanced network services, including dynamic circuits, with the LHC data management systems: PanDA in (US) Atlas PhEDEx in (US) CMS
Includes leading personnel forthe data production systems Kauschik De (PanDA Lead) Tony Wildish (PhEDEx Lead)
2
2000 MBytes/sec
Performance measurements with PhEDEx and FDT for CMS
FDT sustained rates: ~1500MB/sec Average over 24hrs: ~ 1360MB/sec Difference due to delay in starting jobs Bumpy plot due to binning and job size
24 Hour Throughput Reported by PhEDEx
1h moving average
Throughput as reported by MonALISA
1500
1000
500
0
1500
1000
500
0 040214 002216 18 20 0806 1012
High speedWAN link
PhEDEx central agents
Oracle DB @ CERN
Multiple front-ends@CERN, Vanderbilt
sandy01-amswood1-ams
T2_ANSE_Amsterdam
sandy01-gvahermes2
T2_ANSE_Geneva
Storage element
T2_ANSE_aaa
T2_ANSE_zzz
3͙
Site agents
Generic PhEDEx site
FTS/FDT/SRM/...
PhEDEx testbed in ANSE
T2_ANSE_Geneva & T2_ANSE_Amsterdam• High Capacity link with dynamic circuit
creation between storage nodes• PhEDEx and storage nodes separate• 4x4 SSD RAID 0 arrays,
16 physical CPU cores / machine
PhEDEx throughput on a shared path
(with 5 Gbps of UDP cross traffic)
Seamless switchoverNo interruption of service
CMS: PhEDEx and Dynamic Circuits
sandy01-amswood1-ams
T2_ANSE_Amsterdam
sandy01-gva hermes2
T2_ANSE_Geneva
High speed WAN circuit
Shared path
Latest efforts: integrating circuit awareness into the FileDownload agent:
• Prototype is backend agnostic; No modifications to PhEDEx DB• All control logic is in the FileDownload agent• Transparent for all other PhEDEx instances
PhEDEx throughput on a dedicated path
1h moving average
1h moving average600
400
200
1200
1000
800
0403020100 06050
Testing circuit integration into the Download agentPhEDEx transfer rates (in MB/sec)
181210090807 11 16151413 17 222120192221 23
Using dynamic circuits in PhEDEx allows for more deterministic workflows, useful for co-scheduling CPU with data movement
Vlad LapadatescuTony Wildish
25M Jobs at > 100 Sites Now Completed Each Month
6X Growth in 3 Years (2010-13):
Production and Distributed Analysis
Kaushik De
STEP1: Import network information into PanDA STEP2: Use network information directly to optimize workflow for data
transfer/access; at a higher level than individual transfers alone Start with simple use cases leading to measureable
improvements in workflow/user experience
A New Plateau
1. Faster User Analysis Analysis jobs normally go to sites with local data:
sometimes leads to long wait times due to queuing Could use network information to assign work to
‘nearby’ sites with idle CPUs and good connectivity
2. Optimal Cloud Selection Tier2s are connected to Tier1 “Clouds”, manually
by the ops team (may be attached to multiple Tier1s) To be automated using network info: Algorithm under test
3. PD2P = PanDA Dynamic Data Placement: Asynchronous usage-based Repeated use of data or Backlog in Processing Make add’l copies Rebrokerage of queues new data locations
PD2P is perfect for network integration Use network for strategic replication + site selection – tested soon Try SDN provisioning since this usually involves large datasets
USE CASES Kaushik De
DYNES (NSF MRI-R2): Dynamic Circuits Nationwide: Led by Internet2 with Caltech
7
DYNES is extending circuit capabilities to ~50 US campuses
Turns out to be nontrivial
Will be an integral part of the point-to-point service in LHCONE
Partners: I2, Caltech, Michigan, Vanderbilt. Working with I2 and
ESnet on dynamic circuits issuessoftware
http://internet2.edu/dynes
Extended the OSCARS scope; Transition: DRAGON to PSS, OESS
Challenges Encountered
perfSONAR deployment status For meaningful results, we need most LHC computing sites equipped
with perfSONAR nodes. This is work in progress. Easy to use perfSONAR API: Was missing, but a REST API has been
made available recently Inter-domain Dynamic Circuits
Intra-domain systems have been in production for some time E.g. ESnet uses OSCARS as production tool since several
years OESS (OpenFlow-based) also in production – single domain
Inter-domain circuit provisioning continues to be hard Implementations are fragile; Error recovery tends to
require manual intervention
Holistic approach needed: pervasive monitoring + tracking of configuration state changes; intelligent clean-up and timeout handling
NSI framework needs faster standardization, adoption and implementation among the major networks, or
Future SDN-based solution: for example OpenFlow and Open Daylight
Some of the DYNES Challenges Encountered; Approaches to a Solution
Some of the issues encountered in both the control and data planes came from immaturity of the implementation at the time Failed request left configuration on switches, causing subsequent failures Too long time to get failure notification, blocks serialized requests Error messages often erratic hard to find root cause of problem End-to-end VLAN translation not always resulting in functional data plane
Static data plane configuration need changes upon upgrades Grid certificates validity (1 year), over 40+ sites led to frequent expiration
issues – not DYNES specific! Solution: We use nagios to monitor certificate states at all DYNES sites,
generating early warning to the local administrators. Alternate solution would be to create a DYNES CA, and administer certificates
in a coordinated way. Requires a responsible party. DYNES path forward:
Working with a selected subset of sites on getting automated tests failure free Taking input from these – propagate changes to other sites, and/or deploy NSI If funding allows (future proposal): an SDN based multidomain solution
Efficient Long Distance End to End Throughput from the Campus
Over 100G Networks
Harvey Newman, Artur Barczyk, Azher Mughal
California Institute of Technology
NSF CC-NIE Meeting, Washington DC
April 30, 2014
SC06 BWC: Fast Data Transferhttp://monalisa.cern.ch/FDT
An easy to use open source Java application that runs on all major platforms
Uses asynch. multithreaded system toachieve smooth, linear data flow: Streams a dataset (list of files) contin-
uously through an open TCP socket No protocol Start/stops between files Sends buffers at rate matched to the
monitored capability of end to end path Use independent threads to read
& write on each physical device
SC06 BWC: Stable disk-to-disk flows Tampa-Caltech: 10-to-10 and 8-to-8 1U Server-pairs for9 + 7 = 16 Gbps; then Solid overnight. Using One 10G link
17.77 Gbps BWC peak; + 8.6 Gbps to and from Korea
By SC07: ~70-100 Gbps per rack of low cost 1U servers I. Legrand
I. Legrand
Forward to 2014: Long-distance Wide Area 100G Data Transfers
100 G
Caltech SC’13 Demo: Solid 99-100G Throughput on
one 100G Wave; Up to 325G WAN Traffic
Caltech SC’13 Demo: Solid 99-100G Throughput on
one 100G Wave; Up to 325G WAN Traffic
BUT: using 100G infrastructure for production efficiently revealed several challenges
BUT: using 100G infrastructure for production efficiently revealed several challenges
Mostly end-system related:Need IRQ affinity tuning
Multi-core support (multi-threaded applications)Storage controller limitations – mainly SW driver
+ CPU-controller-NIC flow control ?
Mostly end-system related:Need IRQ affinity tuning
Multi-core support (multi-threaded applications)Storage controller limitations – mainly SW driver
+ CPU-controller-NIC flow control ?
It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregated traffic of several hosts
It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregated traffic of several hosts
70-74 Gbps Caltech – Internet2
– ANA100 – CERN Note: single server,
multiple TCP streams using FDT
tool
Network Path Layout: Caltech (CHOPIN: CC-NIE) – CENIC – Internet2 – ANA100 – Amsterdam (SURFnet) – CERN (US LHCNet)
Internet2AL2S to
ANA - CERN
Caltech, Pasadena Caltech @ CERN
Cisco 15454Echo-7
Echo-6
CHOPIN 100G
100G3 x 40G
3 x 40G
100G
Sandy01
Sandy03
Brocade100GE Switch
100G
2 x 40G
2 x 40G
100GCisco 15454
Azher MughalRamiro Voicu
CENIC
CHOPIN: 100G Advanced Networking + Science
Driver Targets for 2014-15 LIGO Scientific Collab. Astro Sky Surveys; VOs Geodetic + Seismic Nets Genomics: On-Chip
Gene Sequencing
100G TCP tests CERN Caltech This Week (Cont’d)
Peaks ~83Gbpson some
AL2S segments
You need strong and willing partners:
Caltech CampusCENIC
Internet2
ANA-100SURFNet
CERN
+ Engagement
• Server 2: Newer generation (E5 2690V2 Ivy Bridge); same chassis as Server 1; issue with newer CPUs and the Mellanox 40GE NICs; Engaged with the vendors (Mellanox, Intel; LSI)
100G TCP Tests CERN Caltech; An Issue
Server 1: ~58 Gbps Server 2: Only ~12 Gbps
Expect further improvements once this issue is resolvedLessons Learned: Need a strong team with the right talents,
a systems approach and especially strong partnerships: regional, national, global; manufacturers
CHOPIN Network Layout (CC-NIE Grant)
100GE Backbone capacity,
operational
External connectivity to major carriers
including CENIC, Esnet, Internet2 and PacWave
LIGO and IPAC are in process to join
using 10G and 40G links
CHOPIN WAN Connections
External connectivity to CENIC, Esnet, Internet2 and
PacWave
Able to create Layer2 paths
using Internet2 OESS portal over
the AL2S US footprints
Dynamic Circuits through Internet2
ION over the 100GE path
Caltech CMS fully integrated with
100GE Backbone
IP Peering with Internet2 and UFL at 100GE … ready for next LHC run
Current peaks are around 8Gbps
capacity, operational
CHOPIN – CMS Tier2 Integration
Key Issue and Approach to a Solution: Next Generation System for LHC + Other Fields
Present Solutions will not scale Beyond LHC Run2
We need: an agile architecture exploiting globally distributed grid, cloud, specialized (e.g. GPU) & opportunistic computing resources
A Services System that moves the data flexibly and dynamically, and behaves coherently
Examples do exist, with smaller but still very large scope
A pervasive, agile autonomous agent architecture that deals with complexity
Developed by talented system developers with a deep appreciation of networks
Grid Job Lifelines-*
Grid Topology
MonALISA
Automated Transfers on Dynamic Networks
MonALISA
ALICE Grid
Key Issues for ANSE
ANSE is in its second year. We should develop the timeline and miiestones for the remainder of the project
We should identify and clearly delineate a strong set of deliverables to improve data management and processing during LHC Run 2
We should communicate and discuss these with the experiments, to get their feedback and engagement
We need to deal with a few key issues: Dynamic circuits: DYNES. We need a clear path forward;
Can we solve some of the problems with SDN ? PerfSONAR, and filling in our monitoring needs (e.g. with ML ?) How to integrate high throughput methods in PanDA, and
get high throughput (via PhEDEx) into production in CMS We have some bigger issues arising, and we need to discuss
these among the PIs and project leaders during this meeting