Advanced Network Services for Experiments (ANSE) Introduction: Progress, Outlook and Issues Harvey B Newman California Institute of Technology ANSE Annual.

Advanced Network Services for Experiments (ANSE)

Introduction: Progress, Outlookand Issues

Harvey B NewmanCalifornia Institute of Technology

ANSE Annual MeetingVanderbilt

May 14, 2014

ANSE: Advanced Network Services for (LHC) Experiments

NSF CC-NIE Funded, 4 US Institutes: Caltech, Vanderbilt, Michigan, UT Arlington. A US ATLAS / US CMS Collaboration

Goal: provide more efficient, deterministic workflows

Method: Interface advanced network services, including dynamic circuits, with the LHC data management systems: PanDA in (US) Atlas PhEDEx in (US) CMS

Includes leading personnel forthe data production systems Kauschik De (PanDA Lead) Tony Wildish (PhEDEx Lead)

2

2000 MBytes/sec

Performance measurements with PhEDEx and FDT for CMS

FDT sustained rates: ~1500MB/sec Average over 24hrs: ~ 1360MB/sec Difference due to delay in starting jobs Bumpy plot due to binning and job size

24 Hour Throughput Reported by PhEDEx

1h moving average

Throughput as reported by MonALISA

1500

1000

500

0

1500

1000

500

0 040214 002216 18 20 0806 1012

High speedWAN link

PhEDEx central agents

Oracle DB @ CERN

Multiple front-ends@CERN, Vanderbilt

sandy01-amswood1-ams

T2_ANSE_Amsterdam

sandy01-gvahermes2

T2_ANSE_Geneva

Storage element

T2_ANSE_aaa

T2_ANSE_zzz

3͙

Site agents

Generic PhEDEx site

FTS/FDT/SRM/...

PhEDEx testbed in ANSE

T2_ANSE_Geneva & T2_ANSE_Amsterdam• High Capacity link with dynamic circuit

creation between storage nodes• PhEDEx and storage nodes separate• 4x4 SSD RAID 0 arrays,

16 physical CPU cores / machine

PhEDEx throughput on a shared path

(with 5 Gbps of UDP cross traffic)

Seamless switchoverNo interruption of service

CMS: PhEDEx and Dynamic Circuits

sandy01-amswood1-ams

T2_ANSE_Amsterdam

sandy01-gva hermes2

T2_ANSE_Geneva

High speed WAN circuit

Shared path

Latest efforts: integrating circuit awareness into the FileDownload agent:

• Prototype is backend agnostic; No modifications to PhEDEx DB• All control logic is in the FileDownload agent• Transparent for all other PhEDEx instances

PhEDEx throughput on a dedicated path

1h moving average

1h moving average600

400

200

1200

1000

800

0403020100 06050

Testing circuit integration into the Download agentPhEDEx transfer rates (in MB/sec)

181210090807 11 16151413 17 222120192221 23

Using dynamic circuits in PhEDEx allows for more deterministic workflows, useful for co-scheduling CPU with data movement

Vlad LapadatescuTony Wildish

25M Jobs at > 100 Sites Now Completed Each Month

6X Growth in 3 Years (2010-13):

Production and Distributed Analysis

Kaushik De

STEP1: Import network information into PanDA STEP2: Use network information directly to optimize workflow for data

transfer/access; at a higher level than individual transfers alone Start with simple use cases leading to measureable

improvements in workflow/user experience

A New Plateau

1. Faster User Analysis Analysis jobs normally go to sites with local data:

sometimes leads to long wait times due to queuing Could use network information to assign work to

‘nearby’ sites with idle CPUs and good connectivity

2. Optimal Cloud Selection Tier2s are connected to Tier1 “Clouds”, manually

by the ops team (may be attached to multiple Tier1s) To be automated using network info: Algorithm under test

3. PD2P = PanDA Dynamic Data Placement: Asynchronous usage-based Repeated use of data or Backlog in Processing Make add’l copies Rebrokerage of queues new data locations

PD2P is perfect for network integration Use network for strategic replication + site selection – tested soon Try SDN provisioning since this usually involves large datasets

USE CASES Kaushik De

DYNES (NSF MRI-R2): Dynamic Circuits Nationwide: Led by Internet2 with Caltech

7

DYNES is extending circuit capabilities to ~50 US campuses

Turns out to be nontrivial

Will be an integral part of the point-to-point service in LHCONE

Partners: I2, Caltech, Michigan, Vanderbilt. Working with I2 and

ESnet on dynamic circuits issuessoftware

http://internet2.edu/dynes

Extended the OSCARS scope; Transition: DRAGON to PSS, OESS

Challenges Encountered

perfSONAR deployment status For meaningful results, we need most LHC computing sites equipped

with perfSONAR nodes. This is work in progress. Easy to use perfSONAR API: Was missing, but a REST API has been

made available recently Inter-domain Dynamic Circuits

Intra-domain systems have been in production for some time E.g. ESnet uses OSCARS as production tool since several

years OESS (OpenFlow-based) also in production – single domain

Inter-domain circuit provisioning continues to be hard Implementations are fragile; Error recovery tends to

require manual intervention

Holistic approach needed: pervasive monitoring + tracking of configuration state changes; intelligent clean-up and timeout handling

NSI framework needs faster standardization, adoption and implementation among the major networks, or

Future SDN-based solution: for example OpenFlow and Open Daylight

Some of the DYNES Challenges Encountered; Approaches to a Solution

Some of the issues encountered in both the control and data planes came from immaturity of the implementation at the time Failed request left configuration on switches, causing subsequent failures Too long time to get failure notification, blocks serialized requests Error messages often erratic hard to find root cause of problem End-to-end VLAN translation not always resulting in functional data plane

Static data plane configuration need changes upon upgrades Grid certificates validity (1 year), over 40+ sites led to frequent expiration

issues – not DYNES specific! Solution: We use nagios to monitor certificate states at all DYNES sites,

generating early warning to the local administrators. Alternate solution would be to create a DYNES CA, and administer certificates

in a coordinated way. Requires a responsible party. DYNES path forward:

Working with a selected subset of sites on getting automated tests failure free Taking input from these – propagate changes to other sites, and/or deploy NSI If funding allows (future proposal): an SDN based multidomain solution

Efficient Long Distance End to End Throughput from the Campus

Over 100G Networks

Harvey Newman, Artur Barczyk, Azher Mughal

California Institute of Technology

NSF CC-NIE Meeting, Washington DC

April 30, 2014

SC06 BWC: Fast Data Transferhttp://monalisa.cern.ch/FDT

An easy to use open source Java application that runs on all major platforms

Uses asynch. multithreaded system toachieve smooth, linear data flow: Streams a dataset (list of files) contin-

uously through an open TCP socket No protocol Start/stops between files Sends buffers at rate matched to the

monitored capability of end to end path Use independent threads to read

& write on each physical device

SC06 BWC: Stable disk-to-disk flows Tampa-Caltech: 10-to-10 and 8-to-8 1U Server-pairs for9 + 7 = 16 Gbps; then Solid overnight. Using One 10G link

17.77 Gbps BWC peak; + 8.6 Gbps to and from Korea

By SC07: ~70-100 Gbps per rack of low cost 1U servers I. Legrand

I. Legrand

Forward to 2014: Long-distance Wide Area 100G Data Transfers

100 G

Caltech SC’13 Demo: Solid 99-100G Throughput on

one 100G Wave; Up to 325G WAN Traffic

Caltech SC’13 Demo: Solid 99-100G Throughput on

one 100G Wave; Up to 325G WAN Traffic

BUT: using 100G infrastructure for production efficiently revealed several challenges

BUT: using 100G infrastructure for production efficiently revealed several challenges

Mostly end-system related:Need IRQ affinity tuning

Multi-core support (multi-threaded applications)Storage controller limitations – mainly SW driver

+ CPU-controller-NIC flow control ?

Mostly end-system related:Need IRQ affinity tuning

Multi-core support (multi-threaded applications)Storage controller limitations – mainly SW driver

+ CPU-controller-NIC flow control ?

It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregated traffic of several hosts

It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregated traffic of several hosts

70-74 Gbps Caltech – Internet2

– ANA100 – CERN Note: single server,

multiple TCP streams using FDT

tool

[email protected]

Network Path Layout: Caltech (CHOPIN: CC-NIE) – CENIC – Internet2 – ANA100 – Amsterdam (SURFnet) – CERN (US LHCNet)

Internet2AL2S to

ANA - CERN

Caltech, Pasadena Caltech @ CERN

Cisco 15454Echo-7

Echo-6

CHOPIN 100G

100G3 x 40G

3 x 40G

100G

Sandy01

Sandy03

Brocade100GE Switch

100G

2 x 40G

2 x 40G

100GCisco 15454

Azher MughalRamiro Voicu

CENIC

CHOPIN: 100G Advanced Networking + Science

Driver Targets for 2014-15 LIGO Scientific Collab. Astro Sky Surveys; VOs Geodetic + Seismic Nets Genomics: On-Chip

Gene Sequencing

100G TCP tests CERN Caltech This Week (Cont’d)

[email protected]

Peaks ~83Gbpson some

AL2S segments

You need strong and willing partners:

Caltech CampusCENIC

Internet2

ANA-100SURFNet

CERN

+ Engagement

• Server 2: Newer generation (E5 2690V2 Ivy Bridge); same chassis as Server 1; issue with newer CPUs and the Mellanox 40GE NICs; Engaged with the vendors (Mellanox, Intel; LSI)

100G TCP Tests CERN Caltech; An Issue

[email protected]

Server 1: ~58 Gbps Server 2: Only ~12 Gbps

Expect further improvements once this issue is resolvedLessons Learned: Need a strong team with the right talents,

a systems approach and especially strong partnerships: regional, national, global; manufacturers

CHOPIN Network Layout (CC-NIE Grant)

[email protected]

100GE Backbone capacity,

operational

External connectivity to major carriers

including CENIC, Esnet, Internet2 and PacWave

LIGO and IPAC are in process to join

using 10G and 40G links

[email protected]

CHOPIN WAN Connections

External connectivity to CENIC, Esnet, Internet2 and

PacWave

Able to create Layer2 paths

using Internet2 OESS portal over

the AL2S US footprints

Dynamic Circuits through Internet2

ION over the 100GE path

[email protected]

Caltech CMS fully integrated with

100GE Backbone

IP Peering with Internet2 and UFL at 100GE … ready for next LHC run

Current peaks are around 8Gbps

capacity, operational

CHOPIN – CMS Tier2 Integration

Key Issue and Approach to a Solution: Next Generation System for LHC + Other Fields

Present Solutions will not scale Beyond LHC Run2

We need: an agile architecture exploiting globally distributed grid, cloud, specialized (e.g. GPU) & opportunistic computing resources

A Services System that moves the data flexibly and dynamically, and behaves coherently

Examples do exist, with smaller but still very large scope

A pervasive, agile autonomous agent architecture that deals with complexity

Developed by talented system developers with a deep appreciation of networks

Grid Job Lifelines-*

Grid Topology

MonALISA

Automated Transfers on Dynamic Networks

MonALISA

ALICE Grid

Key Issues for ANSE

ANSE is in its second year. We should develop the timeline and miiestones for the remainder of the project

We should identify and clearly delineate a strong set of deliverables to improve data management and processing during LHC Run 2

We should communicate and discuss these with the experiments, to get their feedback and engagement

We need to deal with a few key issues: Dynamic circuits: DYNES. We need a clear path forward;

Can we solve some of the problems with SDN ? PerfSONAR, and filling in our monitoring needs (e.g. with ML ?) How to integrate high throughput methods in PanDA, and

get high throughput (via PhEDEx) into production in CMS We have some bigger issues arising, and we need to discuss

these among the PIs and project leaders during this meeting

Advanced Network Services for Experiments (ANSE) Introduction: Progress, Outlook and Issues Harvey B Newman California Institute of Technology ANSE Annual.

Documents

network integrationuse

panda dynamic data placement

data transferaccess

local data

import network information

dynamic circuits nationwide

dynamic circuit creation

lhc data management