High speed networks and Java (Ryan Sciampacone)

High Speed NetworksFree Performance or New Bottlenecks?

Ryan Sciampacone – IBM Java Runtime Lead1st October 2012

Important Disclaimers

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FORINFORMATIONAL PURPOSES ONLY.

WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OFTHE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”,WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHEREDIN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASEDON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.

ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS AGUIDE.

IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ONIBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TOCHANGE BY IBM, WITHOUT NOTICE.

IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANYDAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THISPRESENTATION OR ANY OTHER DOCUMENTATION.

NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THEEFFECT OF:

- CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATEDCOMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS

Introduction to the speaker

■ 15 years experience developing and deploying Java SDKs

■ Recent work focus:

■ Managed Runtime Architecture

■ Java Virtual Machine improvements

■ Multi-tenancy technology

■ Native data access and heap density

■ Footprint and performance

■ Garbage Collection

■ Scalability and pause time reduction

■ Advanced GC technology

■ My contact information:– Ryan_Sciampacone@ca.ibm.com

What should you get from this talk?

■ Understand the current state of high speed networks in the context of Javadevelopment and take away a clear view of the issues involved. Learn practicalapproaches to achieving great performance, including how to understand resultsthat initially don’t make sense.

Life In The Fast Lane

■ “Never underestimate the bandwidth of a station wagon full of tapes hurtling downthe highway.”-- Andrew S. Tanenbaum, Computer Networks, 4th ed., p. 91

■ Networks often just thought of as a simple interconnect between systems

■ No real differentiators– WAN vs. LAN– Wired vs. Wireless

■ APIs traditionally make this invisible– Socket API is good at hiding things (SDP, SMC-R, TCP/IP)

■ Can today’s network offerings be exploited to improve existing performance?

Network Overview

Network Speeds Over Time

■ Consistent advancement in speeds over the years

■ Networks have come a long way in that time

InfiniBand

10GigE

100Mbs

Comparison of Network Speeds

Network Speeds Over Time

■ Oh sorry – that was a logarithmic scaled chart!

InfiniBand

10GigE

100Mbs

Comparison of Network Speeds

Network Speeds vs. The World

■ Bandwidth differences between memory and InfiniBand still a ways off

■ But the gap is getting smaller!

Core i7

InfiniBand

10GigE

Networks vs. Other Storage Bandwidth

Networks Now vs. Yesterday

■ Real opportunity to look at decentralized systems

■ Already true:– Cloud computing– Data grids– Distributed computation

■ Network distance isn’t as far as it used to be!

What is InfiniBand?

■ Originated in 1999 from the merger of two competing designs

■ Features– High throughput– Low Latency– Quality of Service– Failover– Designed to be scalable

■ Offers low latency RDMA (Remote Direct Memory Access)

■ Uses a different programming model than traditional sockets– No “standard” API – De-facto: OFED (OpenFabrics Enterprise Distribution)– Upper layer protocols (ULPs) exist to ease the pain of development

IB vs. IPoIB vs. SDP – InfiniBand

Application

■ Handles all transmission aspects (guarantees, transmission units, etc)

■ Extremely low CPU cost

Modified application usingIB specific communicationmechanism

Bypass of Kernel facilities.Effectively a “zero hop” toThe communication layer

IB Services

IB Core

Device Driver

IB vs. IPoIB vs. SDP – IP over InfiniBand

Application

IB Services

IB Core

Device Driver

■ Effectively TCP/IP stack using a “device driver” to interface the IB layer

■ High CPU cost

Application uses standardsocket APIs for communication

Entire TCP/IP stackused but resides on amapping / conversionlayer (IPoIB)

Application

Socket API

IB Core

Device Driver

TCP/IP

IB vs. IPoIB vs. SDP – IP over InfiniBand

Application

IB Services

IB Core

Device Driver

Application

Socket API

IB Core

Device Driver

TCP/IP

Application

Socket API

Application usesstandardsocket APIs forcommunication

Although socketAPI baseduses its own lighterweightmechanisms andmappingsto leverage IBIB Core

Device Driver

■ Largely bypasses the kernel but still incurs an extra hop during transmission

■ Medium CPU cost

Throughput vs. Latency

Data unit used for measuringthroughput and latency

Length of time for a data unit to travelfrom the start to end pointData unit used for measuring

throughput and latency

Length of time for a data unit to travelfrom the start to end point

LatencyData unit used for measuring

e.g., 10ms

LatencyData unit used for measuring

e.g., 10ms

LatencyNumbers of data units

that arrive per timemeasurement

e.g., 10ms

Throughput

e.g., 10ms

e.g., 10Gb/s

■ Shower analogy– Diameter of the pipe gives you water throughput– Length determines the time it takes for a drop to travel from end to end

Throughput

e.g., 10ms

e.g., 10Gb/s

■ Motivations can characterize priorities– They are not necessarily related!

■ Higher throughput rates offer interesting optimization possibilities– Reduced pressure on compressing data– Reduced pressure on being selective about what data to send

■ For something like RDMA… just send the entire page

Simple Test using IB

Simple Test using IB – Background

■ Experiment: Can Java exploit RDMA to get better performance?

■ Tests conducted– Send different sized packets from a client to a server– Time required to complete write– Test variations include communication layer with RDMA

■ Conditions– Single threaded– 40Gb/s InfiniBand

■ Goal being to look at– Network speeds– Baseline overhead that Java imposes over C equivalent programs– Existing issues that may not have been predicted

■ Also going to look at very basic Java overhead– Comparisons will go against C equivalent program

Simple Test using IB – IPoIB Comparison

Better

■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)

■ Observations– C code is initially faster than Java implementation– Generally even after 128k payload size

Throughput comparison for C / Java

1 4 16 64 256 1k 4k 16

6k 1m 4m 16m

Payload Size

C IPoIB

Java DBB IPoIB

Simple Test using IB – SDP Comparison

■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)

■ Observations– C code is initially faster than Java implementation– Generally even after 128k payload size

Better

1 4 16 64 256 1k 4k 16

6k 1m 4m 16m

Payload Size

C IPoIB

Java DBB IPoIB

Java DBB SDP

Interlude – Zero Copy 64k Boundary

■ Classic networking (java.net) package

Native KernelJava

JNIbyte[ ]

Native KernelJava

JNIbyte[ ]

write data

Native KernelJava

JNIbyte[ ]

write data

Native KernelJava

JNIbyte[ ]

write data

Native KernelJava

JNIbyte[ ]

write data

copy copy

Native KernelJava

JNIbyte[ ]

write data

copy copy

Transmit

■ 2 copies before data gets transmitted

■ Lots of CPU burn, lots of memory being consumed

■ Using DirectByteBuffer with SDP

Native KernelJava

write data

Native KernelJava

write data

Native KernelJava

write data

Transmit

■ 1 copy before data gets transmitted

■ Less CPU burn, less memory being consumed

■ But when the payload hits the “zero copy” threshold in SDP…

Native KernelJava

write data

Native KernelJava

write data

Memory is “registered” for usewith RDMA (direct send from

user space memory)

Native KernelJava

write data

user space memory)

This is extremelyexpensive / slow!

Native KernelJava

write data

user space memory)

Transmit

This is extremelyexpensive / slow!

Native KernelJava

write data

Memory is “unregistered” whenthe send completes

■ 1 copy before data gets transmitted

■ Register / unregister is prohibitively expensive (every transmit!)

1 4 16 64 256 1k 4k 16

6k 1m 4m 16m

Payload Size

C IPoIB

Java DBB IPoIB

Java DBB SDP

Simple Test using IB – SDP Comparison

■ Post Zero Copy threshold there is a sharp drop– Cost of memory register / unregister

■ Eventual climb and plateau– Benefits of zero copy cannot outweigh the drawbacks

Better

Simple Test using IB – RDMA Comparison

Better

■ No “zero copy” threshold issues– Always zero copy– Memory registered once, reused

■ Throughput does finally plateau– Single thread – pipe is hardly saturated

1 4 16 64 256 1k 4k16k

64k256

k1m 4m

16m64m

Payload Size

C IPoIB

Java DBB IPoIB

Java DBB SDP

C RDMA/W

Java DBB RDMA/W

Simple Test using IB – What about that Zero CopyThreshold?

■ SDP ultimately has a plateau here– Possibly other deeper tuning aspects available

■ Pushing the threshold for zero copy out has no advantange

■ Claw back is still ultimately limited– Likely gated by some other aspect of the system

■ 64KB threshold (default) seems to be the “sweet spot”

Better

Zero Copy Threshold Comparison

1 4 16 64 256 1k 4k 16

6k 1m 4m 16m

Payload Size

Simple Test using IB – Summary

■ Simple steps to start using– IPoIB lets you use your application ‘as is’

■ Increased speed can potentially involve significant application changes– Potential need for deeper technical knowledge– SDP is an interesting stop gap

■ There are hidden gotchas!– Increased load changes the game – but this is standard when dealing with computers

ORB and High Speed Networks

Benchmarking the ORB – Background

■ Experiment: How does the ORB perform over InfiniBand?

■ Tests conducted– Send different sized packets from a client to a server– Time required for write followed by read– Compare standard Ethernet to SDP / IPoIB

■ Conditions– 500 client threads– Echo style test (send to server, server echoes data back)– byte[] payload– 40Gb/s InfiniBand

■ Goal being to look at– ORB performance when data pipe isn’t the bottleneck (Time to complete benchmark)– Threading performance

■ Realistically expecting to discover bottlenecks in the ORB

Benchmarking the ORB – Ethernet Results

■ Standard Ethernet with the classic java.net package

Better

ORB Echo Test Performance

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m

Payload Size

Benchmarking the ORB – SDP

■ …And this is with SDP (could be better)

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m

Payload Size

Benchmarking the ORB – ORB Transmission Buffers

byte[ ]

write data

byte[ ]

write data

Internal bufferfor transmission

byte[ ]

write data2KB 1KB

byte[ ]

write data1KB

1KB1KB

byte[ ]

write data

1KB1KB1KB

byte[ ]

write data

Transmit

■ Many additional costs being incurred (per thread!) to transmit a byte array

1KB1KB1KB

Transmit

■ Existing bottlenecks outside the ORB (buffer management)

■ Throughput couldn’t be pushed much further

■ 3KB to 4KB ORB buffer sizes were sufficient for Ethernet

Socket Layer

■ 64KB was the best size for SDP

Native

Transmit

■ Zero Copy Threshold!

Benchmarking the ORB – Garbage Collector Impact

■ Allocating large objects (e.g., buffers) can be a costly operation

Free Memory

Allocated Memory

Free Memory

Allocated Memory

Buffer

Free Memory

Allocated Memory

Allocate where?Buffer

Free Memory

Allocated Memory

■ Premature Garbage Collections in order to “clear space” for large allocations

Buffer Allocate where?

Benchmarking the ORB – Thread Pools

■ Thread and connection count ratios are a factor

Client Server

Threads

Client Server

1 Connection500

Threads

Client Server

1 Connection

■ Highly contended resource

■ Couldn’t saturate the communication channel

Threads

Client Server

Threads

Client Server

Threads

Client Server

Threads

500 Connections

Client Server

■ Context switching disaster

■ Threads queued and unable to complete transmit

■ Memory / resource consumption nightmare

Threads

500 Connections

Client Server

Threads

Client Server

Threads

10 Connections

Client Server

■ 2-5% of the client thread count appeared to be best

■ Saturate the communication pipe enough to achieve best throughput

■ Keep resource consumption and context switches to a minimum

Threads

10 Connections

Benchmarking the ORB – Post Optimization Round

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m

Payload Size

■ Hey, great! Still not super (or the difference you’d expect) but it’s a good start

■ NOTE: 64k threshold definitely a big part of the whole thing

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m

Payload Size

SDP (New)

■ No surprises, IPoIB has higher overhead than SDP

■ 64KB numbers are actually quite close – so still issues to discover and fix

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m

Payload Size

SDP (New)

Benchmarking the ORB – Summary

■ It’s not as easy as “stepping on the gas”– High speed networks alone don’t resolve your problems.– Software layers are going to have bottlenecks.– Improvements for high speed networks can help traditional ones as well

■ Benefit is not always clear cut

And after all that…

Conclusion

■ High speed networks are a game changer

■ Simple to use, hard to use effectively

■ Expectations based on past results need to be re-evaluated

■ Existing applications / frameworks may need tuning or optimization

■ Opening of potentially new possibilities

Questions?

References

■ Get Products and Technologies:– IBM Java Runtimes and SDKs:

• https://www.ibm.com/developerworks/java/jdk/– IBM Monitoring and Diagnostic Tools for Java:

• https://www.ibm.com/developerworks/java/jdk/tools/

■ Learn:– IBM Java InfoCenter:

• http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp

■ Discuss:– IBM Java Runtimes and SDKs Forum:

• http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0

Copyright and Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks ofInternational Business Machines Corp., and registered in many jurisdictionsworldwide.

Other product and service names might be trademarks of IBM or other companies.

A current list of IBM trademarks is available on the Web – see the IBM “Copyrightand trademark information” page at URL: www.ibm.com/legal/copytrade.shtml

High speed networks and Java (Ryan Sciampacone)

ibm corporation

performance data

network overview6

years networks

world networks

computer networks

standard socket apis

invisible socket api

Technology

Archbishop Ryan Correspondence - CHRC · Archbishop Ryan...

Runtime Innovation - Nextgen Ninja Hacking of the JVM, by...

Soldado ryan

Java, Java, Java

Ryan hawaii

Lauren & Ryan

By: Max Weber, Bobby Marlowe, Ryan Miller, Ryan McCarthy,...

Ryan Liptak, Ryan Merk, Mary Frances Meier Microeconomics.

Ryan Group

Comunicação entre Agentes Inteligentes Ryan Leite...

Secrets of Successful Java Testing · Java Test Tools Run.....

Exercises Java - · PDF fileExercises Java Øyvind Ryan...

Robin & Ryan

NEXT GENERATION catalysts - Ryan Jenkins · BECOMING...

Introduction to Java. Overview History of Java Java Releases...

Hybrid Mobile Development with Apache Cordova and Java EE 7....