The Mul(coreaware Data Transfer Middleware (MDTM) Project L. Zhang, T. Li, S. Jin, D. Katramatos, L. Carpenter, P. DeMar, D. Yu (CoPI), W.Wu (PI) 2015 Technology Exchange Cleveland OH, Oct 47, 2015 Funded by: ASCR/DOE Network Research Program
The Mul(core-‐aware Data Transfer Middleware (MDTM) Project
L. Zhang, T. Li, S. Jin, D. Katramatos, L. Carpenter, P.
DeMar, D. Yu (Co-‐PI), W.Wu (PI)
2015 Technology Exchange Cleveland OH, Oct 4-‐7, 2015
Funded by: ASCR/DOE Network Research Program
Agenda
• The MDTM Project Backgrounds
• Part 1: Mul(core-‐Aware Data Transfer Middleware (MDTM) – Liang Zhang, FNAL
• Part 2: MDTM Data Transfer Applica(ons (mdtmBBCP) – Dantong Yu, BNL
• Integra(on: Part 1 + Part 2
• Future work
Problem Space MulYcore/manycore has become the norm for high-‐performance compuYng.
ExisYng data movement tools (e.g., BBCP, GridFTP) are limited by major inefficiencies when run on mulYcore systems
These inefficiencies will ul(mately result in performance boPlenecks on end systems. Such boPlenecks also impede the effec(ve use of advanced high-‐bandwidth networks.
A simple inefficiency case …
IOH1
NICStorage
IOH2
GPU
NUMA NODE 1 NUMA NODE 2
DataTransferThread
Data Transfer Node ( DTN)
cores
Remote I/O Access
InterconnectIOH1
NIC
Storage
IOH2
GPU
NUMA NODE 1 NUMA NODE 2
Interconnect
Local I/O Access
Data Transfer Node ( DTN)
DataTransferThreadcores
Scheduling without I/O locality
How can we improve?
Scheduling with I/O locality
General-‐purpose OSes have only limited support for I/O locality!
Our solu(on
• The Mul(core-‐aware Data Transfer Middleware (MDTM) Project – CollaboraYve effort by Fermilab and Brookhaven NaYonal Laboratory
– Funded by DOE’s Office of Advanced ScienYfic CompuYng Research (ASCR)
– A three-‐year research project
MDTM aims to accelerate data movement toolkits on mul(core systems
MDTM Architecture
MDTM Middleware Services
OS Services
MDTM Data Transfer Applications/Tools
Hardware
Access services
Access services
Access services
MDTM data transfer applica(on • Data transfer applicaYons that use MDTM middleware service MDTM middleware services • A user space scheduler that schedule and assigns system resources
based on the needs of data transfer applicaYons. It also takes into account other factors, including NUMA topology, I/O locality, and Qos
MDTM is targeted for Data Transfer Node (DTN)
IOH
IOH
MemQPI QPI
PCIE
Node
1
Node
2
PCIE
PCIE
To WAN Networks(Front end)
To Local Storage(Back end)
To WAN Networks(Front end)
To Local Storage(Back end)
PCI-EController
NIC
NIC
PCIE
Local Disk
Processor
Processor... ... ... ...
...
NIC
NIC
...
System Bus/Switching Fabric
...
IOH
QPI
Node
n
PCIE
To WAN Networks(Front end)
To Local Storage(Back end)
PCIE
Processor
... ...
NIC
NIC
...
Each DTN features one or mulYple NUMA nodes. Each NUMA node features one or mulYple processors that consists of mulYple cores.
A MDTM-‐based DTN Storage and Networking Architecture
DTN
Local storageRaid, SSD
A MDTM-based DTN
Dire
cted
Con
nect
ed S
tora
geFi
ber,
Infin
iban
d
Fiber ChannelInfiniband
Switch/Router
Distributed file systemInfiniband, Ethernet
Infiniband or 10/40 GE links
One/multiple 10/40 GE links to WAN
MDTM MiddlewareMDTM
Middleware
MDTMAPP
MDTMAPP
Switch/Router
MDTM Storage Architecture • Local storage (Raid, SSD) • Directed connected storage (FC, IB) • Distributed file system (IB, 10/40 GE)
MDTM Networking Architecture • One or mul(ple WAN links for data transfer
• Via 10/40 GE NICs • One or mul(ple LAN links for storage access
• Via 10/40 GE NICs, IB adaptors, FC adaptors
Part I Mul(core-‐Aware Data Transfer
Middleware (MDTM)
L. Zhang, FNAL
MDTM Middleware
• A user-‐space resource scheduler that harness mulYcore parallelism to scale data movement toolkits at mulYcore systems – Data Transfer-‐Centric Scheduling and Resource Management Capability based on the needs and requirement of data transfer applicaYons
– NUMA Topology-‐aware Scheduler – Enabling efficient network I/O on mulYcore systems – SupporYng QoS mechanism to allow differenYated data transfer
MDTM Middleware: System Profiling
• Hardware Topology and System ConfiguraYon – System calls – 3rd party libraries like libpci.
• System Status – Core Workload DetecYon – Intensive Threads DetecYon
MDTM Middleware: Scheduling
CPUs/Cores)
PCI)Hubs/Bridges…)
NICs,)Disks)
Connec9on)between)devices)Devices)
• Each connecYon associated with a cost value which reflects scheduling factors like distance, traffic throughput and etc.
• Applying Dijkstra’s Algorithm to find the lowest cost path from CPU cores to NICs/Disks
• Pick up the core associated to the lowest cost path
MDTM Middleware Scheduling
Shared Memory
MDTM Middleware Modules
• MDTM Daemon – Acquiring and publishing system
informa(on – Scheduling and binding applica(on threads – Communica(ng with MDTM consoles and
App.
• MDTM API – Interfacing the MDTM consoles and Apps. – Communica(ng with MDTM Daemon – Reques(ng and reading system informa(on
• MDTM Console – Facilita(ng customers to access system
informa(on and status – Monitoring and development u(lity
OS (Linux)
App. App. App.
MDTM Daemon
MDTM Console
MDTM API
Middlew
are
MDTM Middleware R&D
– MulYcore system profiling – Data transfer-‐centric scheduling and resource management
– NUMA topology-‐aware scheduler – SupporYng core affinity on network and disk I/O capability
– Support NUMA-‐aware buffer pools – Core parYYoning on NUMA system – Intelligent memory management on NUMA system
Part II MDTM Data Transfer Applica(ons
(mdtmBBCP)
Dantong Yu, BNL
mdtmBBCP Design Requirements
• MulYple core awareè Fine-‐granularity design, i.e., end-‐to-‐end data transfer must be split into a sequence of tasks, each of which is handled by dedicated threads.
• I/O devices reside on different NUMA nodesè Must minimize data migraYon overheads from storage to networks
• Users, Transfer Requests, Files Transfers must be opYmized globally and parallelized!
• Resource-‐aware scheduling and pre-‐allocaYon
Data Transfer ApplicaYons/Servers
mdtmBBCP Design
Request/data preprocessing
Thread/flow management
Data access and transmission
Data Transfer Service interface
Storage I/O interface: a) Local disks, b) SAN, c) memdisk/flash disks
MDTM interface
SAN Topology
Block Devices
Fibre Channel SAN
DesYnaYon Host
Control channel
Control Agent
Network Stack
Data Channel
SSD/memdisk
Key techniques: *Metadata access ü AutomaYc Preprocessing for
various types of storages ü Knowledge on storage
system performance via test *Obtain knoweledge on system layout (cores, disks, NICs, etc) *File grouping, sorYng, load balancing *Interface: file systems, storage, MDTM for layout *Data structures: lists, sets, layout table, various staYsYcs
Major Features in the mdtmBBCP • Resource pre-‐allocaYon
– I/O centric thread allocaYon, for storage and network – Shared buffer space – NUMA awareness: cores, disks, NICs
• Request preprocessing (more details in extra slides) – File grouping by I/O device type and locaYon – File sorYng by disk offset – Post transfer data write reordering opYmizaYon
• Different methods for handling large and small files – Large file striping: parallel processing of the data of a single large file
– Small file pipelining: one-‐by-‐one processing of small files. Note: mul3ple groups of files are processed using mul3ple pipelines
Other Features in mdtmBBCP
• Third party support and client/server mode • Security with SSH support • AutomaYc host system configuraYon setup • Data transfer progress report • Support for different I/O mode: direct I/O, asynchronous I/O
• Event driven data transfer task processing
mdtmBBCP R&D • Meta Data Access
– AutomaYc Preprocessing for various types of storages – Knowledge on storage system performance test
• Retrieve System Layout (cores, disks, NIC,etc) for scheduling • Implemented Request preprocessing
– Request decomposiYon and regrouping into smaller tasks enhance – Task grouping for affinity binding and concurrency. – Task sorYng for I/O locality and OpYmizaYon – Load Balance – Improve performance on different storage media
• Implemented Interfaces: file systems, storage, MDTM for layout • Sokware Design and Data Structures: Object-‐oriented, lists and
sets, layout table, various staYsYcs
mdtmBBCP R&D
• Asynchronous request processing – Serve all requests with the pre-‐allocated and reusable thread pools.
– Maximize the file transfer concurrency
• Support for both large file pipelining and small file striping
• Progress report for data transfer jobs
Part III Integra(on
How does MDTM works? A MDTM applicaYon spawns three types of threads – Management threads to handle user requests and management-‐related funcYons
– Dedicated disk/storage I/O threads to read/write from/to disks/storages
– Dedicated network I/O threads to send/receive data A MDTM data transfer applicaYon accesses MDTM middleware services explicitly via APIs In operaYon, an MDTM middleware daemon will be launched. It will support two types of services – Query service allow MDTM APP to access system configuraYon and status
– Scheduling service assigns system resources based on requirements of data transfer applicaYons
MDTM Logical Func(ons and Modules
OS Kernel (and hardware below)
Resource Scheduler
Thread Load Estimation
MDTM App Interface
System Monitor
Thread/flow Management
Statistics Store
Qos/Policy Manager
Request/data Preprocessing
Data Transfer Service Interface
Data Access and Transmission
Admin user input
NUMA access cost modelling
User Interface
Authentication & Access Control
...Data Transfer Application's Native Functions & Modules
MDTM-based Data Transfer Functions & Modules
MDTM Middleware Functions & Modules
Data Transfer Application
Data
tran
sfer
pr
ofile
Res
sche
dulin
g Re
q/Re
s
Stat
us Q
uery
/Res
I/O-‐Centric architecture Parallel data transfer
Data layout preprocessing Disk/network I/O op(miza(on
Data flow-‐centric scheduling NUMA-‐awareness scheduling
I/O locality op(miza(on Maximizing parallelism
Major Ac(vi(es • The MDTM Project DEMO at SC14, New Orleans, November 2014. – hlp://scdoe.info/demo-‐staYon-‐descripYons/
• L. Zhang, T. Li, Y. Ren, P. DeMar, S. Jin, D. Yu, W. Wu, “The MDTM Project”, SC’14 Poster session, New Orleans, LA, 2014.
• ESCC Winter 2015 Talk – hlps://escc.es.net/?q=node/7/107
• MDTM Deployment on ESNET 100G Testbed, July 2015
IniYal Results • We evaluate mdtmBBCP in ESNET 100G test bed. mdtmBBCP is
compared with GridFTP and BBCP. For fair comparisons, all the tools are configured with the same parameters—I/O block size and the number of parallel streams. We use Time-‐to-‐Comple(on (TTC) as the performance metric. The comparison is to transfer a 100GB file from nersc-‐tbn-‐2 to nersc-‐tbn-‐1.
mdtmBBCP GridFTP BBCP
TTC 55s 101s 95s
Future work
• MDTM R&D – ProducYon quality distribuYon kit – QoS
• MDTM field test and deployment – Reaching out to potenYal MDTM users – Alpha-‐release users • E.g., ESnet network engineers
– Beta-‐release users • CMS, ATLAS
MDTM Source Code The latest MDTM Source Code is available at hlps://cdcvs.fnal.gov/redmine/projects/mdtm
Features supported – MulYcore system profiling – Thread/process scheduling – Thread binding with I/O locality and load balancing – NUMA-‐aware memory pre-‐allocaYon and binding – Network I/O affinity
MDTM Project Website
hlp://mdtm.fnal.gov
QuesYons?
Demo hlp://mdtm-‐server.fnal.gov:1337