High speed networks and Java (Ryan Sciampacone)
Post on 12-May-2015
1666 Views
Preview:
DESCRIPTION
Transcript
© 2012 IBM Corporation
High Speed NetworksFree Performance or New Bottlenecks?
Ryan Sciampacone – IBM Java Runtime Lead1st October 2012
2 © 2012 IBM Corporation
Important Disclaimers
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FORINFORMATIONAL PURPOSES ONLY.
WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OFTHE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”,WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHEREDIN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASEDON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.
ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS AGUIDE.
IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ONIBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TOCHANGE BY IBM, WITHOUT NOTICE.
IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANYDAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THISPRESENTATION OR ANY OTHER DOCUMENTATION.
NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THEEFFECT OF:
- CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATEDCOMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS
3 © 2012 IBM Corporation
Introduction to the speaker
■ 15 years experience developing and deploying Java SDKs
■ Recent work focus:
■ Managed Runtime Architecture
■ Java Virtual Machine improvements
■ Multi-tenancy technology
■ Native data access and heap density
■ Footprint and performance
■ Garbage Collection
■ Scalability and pause time reduction
■ Advanced GC technology
■ My contact information:– Ryan_Sciampacone@ca.ibm.com
4 © 2012 IBM Corporation
What should you get from this talk?
■ Understand the current state of high speed networks in the context of Javadevelopment and take away a clear view of the issues involved. Learn practicalapproaches to achieving great performance, including how to understand resultsthat initially don’t make sense.
5 © 2012 IBM Corporation
Life In The Fast Lane
■ “Never underestimate the bandwidth of a station wagon full of tapes hurtling downthe highway.”-- Andrew S. Tanenbaum, Computer Networks, 4th ed., p. 91
■ Networks often just thought of as a simple interconnect between systems
■ No real differentiators– WAN vs. LAN– Wired vs. Wireless
■ APIs traditionally make this invisible– Socket API is good at hiding things (SDP, SMC-R, TCP/IP)
■ Can today’s network offerings be exploited to improve existing performance?
6 © 2012 IBM Corporation
Network Overview
7 © 2012 IBM Corporation
Network Speeds Over Time
■ Consistent advancement in speeds over the years
■ Networks have come a long way in that time
InfiniBand
10GigE
1GigE
100Mbs
10Mbs
Comparison of Network Speeds
8 © 2012 IBM Corporation
Network Speeds Over Time
■ Oh sorry – that was a logarithmic scaled chart!
InfiniBand
10GigE
1GigE
100Mbs
10Mbs
Comparison of Network Speeds
9 © 2012 IBM Corporation
Network Speeds vs. The World
■ Bandwidth differences between memory and InfiniBand still a ways off
■ But the gap is getting smaller!
SSD
Core i7
InfiniBand
10GigE
1GigE
Networks vs. Other Storage Bandwidth
10 © 2012 IBM Corporation
Networks Now vs. Yesterday
■ Real opportunity to look at decentralized systems
■ Already true:– Cloud computing– Data grids– Distributed computation
■ Network distance isn’t as far as it used to be!
11 © 2012 IBM Corporation
What is InfiniBand?
■ Originated in 1999 from the merger of two competing designs
■ Features– High throughput– Low Latency– Quality of Service– Failover– Designed to be scalable
■ Offers low latency RDMA (Remote Direct Memory Access)
■ Uses a different programming model than traditional sockets– No “standard” API – De-facto: OFED (OpenFabrics Enterprise Distribution)– Upper layer protocols (ULPs) exist to ease the pain of development
12 © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – InfiniBand
Application
IB
■ Handles all transmission aspects (guarantees, transmission units, etc)
■ Extremely low CPU cost
Modified application usingIB specific communicationmechanism
Bypass of Kernel facilities.Effectively a “zero hop” toThe communication layer
IB Services
IB Core
Device Driver
13 © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – IP over InfiniBand
Application
IB Services
IB Core
Device Driver
IB
■ Effectively TCP/IP stack using a “device driver” to interface the IB layer
■ High CPU cost
Application uses standardsocket APIs for communication
Entire TCP/IP stackused but resides on amapping / conversionlayer (IPoIB)
Application
Socket API
IB Core
Device Driver
IPoIB
IPoIB
TCP/IP
14 © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – IP over InfiniBand
Application
IB Services
IB Core
Device Driver
IB
Application
Socket API
IB Core
Device Driver
IPoIB
IPoIB
TCP/IP
Application
Socket API
SDP
SDP
Application usesstandardsocket APIs forcommunication
Although socketAPI baseduses its own lighterweightmechanisms andmappingsto leverage IBIB Core
Device Driver
■ Largely bypasses the kernel but still incurs an extra hop during transmission
■ Medium CPU cost
15 © 2012 IBM Corporation
Throughput vs. Latency
16 © 2012 IBM Corporation
Throughput vs. Latency
17 © 2012 IBM Corporation
Throughput vs. Latency
Data unit used for measuringthroughput and latency
18 © 2012 IBM Corporation
Throughput vs. Latency
Data unit used for measuringthroughput and latency
19 © 2012 IBM Corporation
Throughput vs. Latency
Length of time for a data unit to travelfrom the start to end pointData unit used for measuring
throughput and latency
20 © 2012 IBM Corporation
Throughput vs. Latency
Length of time for a data unit to travelfrom the start to end point
LatencyData unit used for measuring
throughput and latency
e.g., 10ms
21 © 2012 IBM Corporation
Throughput vs. Latency
Length of time for a data unit to travelfrom the start to end point
LatencyData unit used for measuring
throughput and latency
e.g., 10ms
22 © 2012 IBM Corporation
Throughput vs. Latency
Length of time for a data unit to travelfrom the start to end point
LatencyNumbers of data units
that arrive per timemeasurement
Data unit used for measuringthroughput and latency
e.g., 10ms
23 © 2012 IBM Corporation
Throughput vs. Latency
Length of time for a data unit to travelfrom the start to end point
LatencyNumbers of data units
that arrive per timemeasurement
Throughput
Data unit used for measuringthroughput and latency
e.g., 10ms
e.g., 10Gb/s
24 © 2012 IBM Corporation
Throughput vs. Latency
■ Shower analogy– Diameter of the pipe gives you water throughput– Length determines the time it takes for a drop to travel from end to end
Length of time for a data unit to travelfrom the start to end point
LatencyNumbers of data units
that arrive per timemeasurement
Throughput
Data unit used for measuringthroughput and latency
e.g., 10ms
e.g., 10Gb/s
25 © 2012 IBM Corporation
Throughput vs. Latency
■ Motivations can characterize priorities– They are not necessarily related!
■ Higher throughput rates offer interesting optimization possibilities– Reduced pressure on compressing data– Reduced pressure on being selective about what data to send
■ For something like RDMA… just send the entire page
26 © 2012 IBM Corporation
Simple Test using IB
27 © 2012 IBM Corporation
Simple Test using IB – Background
■ Experiment: Can Java exploit RDMA to get better performance?
■ Tests conducted– Send different sized packets from a client to a server– Time required to complete write– Test variations include communication layer with RDMA
■ Conditions– Single threaded– 40Gb/s InfiniBand
■ Goal being to look at– Network speeds– Baseline overhead that Java imposes over C equivalent programs– Existing issues that may not have been predicted
■ Also going to look at very basic Java overhead– Comparisons will go against C equivalent program
28 © 2012 IBM Corporation
Simple Test using IB – IPoIB Comparison
Better
■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■ Observations– C code is initially faster than Java implementation– Generally even after 128k payload size
Throughput comparison for C / Java
1 4 16 64 256 1k 4k 16
k64
k25
6k 1m 4m 16m
64m
256m
Payload Size
Th
rou
gh
pu
t
C IPoIB
Java DBB IPoIB
29 © 2012 IBM Corporation
Simple Test using IB – SDP Comparison
■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■ Observations– C code is initially faster than Java implementation– Generally even after 128k payload size
Better
Throughput comparison for C / Java
1 4 16 64 256 1k 4k 16
k64
k25
6k 1m 4m 16m
64m
256m
Payload Size
Th
rou
gh
pu
t
C IPoIB
Java DBB IPoIB
C SDP
Java DBB SDP
30 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNI
31 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
32 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
write data
33 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
write data
34 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
write data
copy
35 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
write data
copy copy
36 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Classic networking (java.net) package
Native KernelJava
JNIbyte[ ]
write data
copy copy
Transmit
■ 2 copies before data gets transmitted
■ Lots of CPU burn, lots of memory being consumed
37 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Using DirectByteBuffer with SDP
Native KernelJava
38 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Using DirectByteBuffer with SDP
Native KernelJava
39 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Using DirectByteBuffer with SDP
Native KernelJava
write data
40 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Using DirectByteBuffer with SDP
Native KernelJava
write data
copy
41 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ Using DirectByteBuffer with SDP
Native KernelJava
write data
copy
Transmit
■ 1 copy before data gets transmitted
■ Less CPU burn, less memory being consumed
42 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
43 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
>64KB
44 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
write data
>64KB
45 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
write data
Memory is “registered” for usewith RDMA (direct send from
user space memory)
>64KB
46 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
write data
Memory is “registered” for usewith RDMA (direct send from
user space memory)
This is extremelyexpensive / slow!
>64KB
47 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
write data
Memory is “registered” for usewith RDMA (direct send from
user space memory)
Transmit
This is extremelyexpensive / slow!
>64KB
48 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary
■ But when the payload hits the “zero copy” threshold in SDP…
Native KernelJava
write data
Memory is “unregistered” whenthe send completes
■ 1 copy before data gets transmitted
■ Register / unregister is prohibitively expensive (every transmit!)
>64KB
49 © 2012 IBM Corporation
Throughput comparison for C / Java
1 4 16 64 256 1k 4k 16
k64
k25
6k 1m 4m 16m
64m
256m
Payload Size
Th
rou
gh
pu
t
C IPoIB
Java DBB IPoIB
C SDP
Java DBB SDP
Simple Test using IB – SDP Comparison
■ Post Zero Copy threshold there is a sharp drop– Cost of memory register / unregister
■ Eventual climb and plateau– Benefits of zero copy cannot outweigh the drawbacks
Better
50 © 2012 IBM Corporation
Simple Test using IB – RDMA Comparison
Better
■ No “zero copy” threshold issues– Always zero copy– Memory registered once, reused
■ Throughput does finally plateau– Single thread – pipe is hardly saturated
Throughput comparison for C / Java
1 4 16 64 256 1k 4k16k
64k256
k1m 4m
16m64m
256m
Payload Size
Th
rou
gh
pu
t
C IPoIB
Java DBB IPoIB
C SDP
Java DBB SDP
C RDMA/W
Java DBB RDMA/W
51 © 2012 IBM Corporation
Simple Test using IB – What about that Zero CopyThreshold?
■ SDP ultimately has a plateau here– Possibly other deeper tuning aspects available
■ Pushing the threshold for zero copy out has no advantange
■ Claw back is still ultimately limited– Likely gated by some other aspect of the system
■ 64KB threshold (default) seems to be the “sweet spot”
Better
Zero Copy Threshold Comparison
1 4 16 64 256 1k 4k 16
k64
k25
6k 1m 4m 16m
64m
256m
Payload Size
Th
rou
gh
pu
t4k
8k
16k
32k
64k
128k
256k
512k
52 © 2012 IBM Corporation
Simple Test using IB – Summary
■ Simple steps to start using– IPoIB lets you use your application ‘as is’
■ Increased speed can potentially involve significant application changes– Potential need for deeper technical knowledge– SDP is an interesting stop gap
■ There are hidden gotchas!– Increased load changes the game – but this is standard when dealing with computers
53 © 2012 IBM Corporation
ORB and High Speed Networks
54 © 2012 IBM Corporation
Benchmarking the ORB – Background
■ Experiment: How does the ORB perform over InfiniBand?
■ Tests conducted– Send different sized packets from a client to a server– Time required for write followed by read– Compare standard Ethernet to SDP / IPoIB
■ Conditions– 500 client threads– Echo style test (send to server, server echoes data back)– byte[] payload– 40Gb/s InfiniBand
■ Goal being to look at– ORB performance when data pipe isn’t the bottleneck (Time to complete benchmark)– Threading performance
■ Realistically expecting to discover bottlenecks in the ORB
55 © 2012 IBM Corporation
Benchmarking the ORB – Ethernet Results
■ Standard Ethernet with the classic java.net package
Better
ORB Echo Test Performance
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size
Tim
e to
Co
mp
lete
ETH
56 © 2012 IBM Corporation
Benchmarking the ORB – SDP
■ …And this is with SDP (could be better)
Better
ORB Echo Test Performance
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size
Tim
e to
Co
mp
lete
ETH
SDP
57 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
byte[ ]
58 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
59 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data
60 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data
Internal bufferfor transmission
61 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data2KB 1KB
62 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data1KB
1KB1KB
63 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data
copy
1KB1KB1KB
64 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
byte[ ]
write data
copy
Transmit
■ Many additional costs being incurred (per thread!) to transmit a byte array
1KB1KB1KB
65 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
ORB
Transmit
■ Existing bottlenecks outside the ORB (buffer management)
■ Throughput couldn’t be pushed much further
■ 3KB to 4KB ORB buffer sizes were sufficient for Ethernet
Socket Layer
4KB
66 © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers
■ 64KB was the best size for SDP
Native
Transmit
ORB
copy
■ Zero Copy Threshold!
64KB
67 © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact
■ Allocating large objects (e.g., buffers) can be a costly operation
Heap
Free Memory
Allocated Memory
68 © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact
■ Allocating large objects (e.g., buffers) can be a costly operation
Heap
Free Memory
Allocated Memory
Buffer
69 © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact
■ Allocating large objects (e.g., buffers) can be a costly operation
Heap
Free Memory
Allocated Memory
Allocate where?Buffer
70 © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact
■ Allocating large objects (e.g., buffers) can be a costly operation
Heap
Free Memory
Allocated Memory
■ Premature Garbage Collections in order to “clear space” for large allocations
Buffer Allocate where?
71 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
72 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
73 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
74 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
1 Connection500
Threads
75 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
1 Connection
■ Highly contended resource
■ Couldn’t saturate the communication channel
500
Threads
76 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
77 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
78 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
500 Connections
79 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
■ Context switching disaster
■ Threads queued and unable to complete transmit
■ Memory / resource consumption nightmare
500
Threads
500 Connections
80 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
81 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
500
Threads
10 Connections
82 © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools
■ Thread and connection count ratios are a factor
Client Server
■ 2-5% of the client thread count appeared to be best
■ Saturate the communication pipe enough to achieve best throughput
■ Keep resource consumption and context switches to a minimum
500
Threads
10 Connections
83 © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round
Better
ORB Echo Test Performance
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size
Tim
e to
Co
mp
lete
ETH
SDP
84 © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round
■ Hey, great! Still not super (or the difference you’d expect) but it’s a good start
■ NOTE: 64k threshold definitely a big part of the whole thing
Better
ORB Echo Test Performance
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size
Tim
e to
Co
mp
lete
ETH
SDP
SDP (New)
85 © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round
■ No surprises, IPoIB has higher overhead than SDP
■ 64KB numbers are actually quite close – so still issues to discover and fix
Better
ORB Echo Test Performance
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size
Tim
e to
Co
mp
lete
ETH
SDP
SDP (New)
IPoIB
86 © 2012 IBM Corporation
Benchmarking the ORB – Summary
■ It’s not as easy as “stepping on the gas”– High speed networks alone don’t resolve your problems.– Software layers are going to have bottlenecks.– Improvements for high speed networks can help traditional ones as well
■ Benefit is not always clear cut
87 © 2012 IBM Corporation
And after all that…
88 © 2012 IBM Corporation
Conclusion
■ High speed networks are a game changer
■ Simple to use, hard to use effectively
■ Expectations based on past results need to be re-evaluated
■ Existing applications / frameworks may need tuning or optimization
■ Opening of potentially new possibilities
89 © 2012 IBM Corporation
Questions?
90 © 2012 IBM Corporation
References
■ Get Products and Technologies:– IBM Java Runtimes and SDKs:
• https://www.ibm.com/developerworks/java/jdk/– IBM Monitoring and Diagnostic Tools for Java:
• https://www.ibm.com/developerworks/java/jdk/tools/
■ Learn:– IBM Java InfoCenter:
• http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp
■ Discuss:– IBM Java Runtimes and SDKs Forum:
• http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0
91 © 2012 IBM Corporation
Copyright and Trademarks
© IBM Corporation 2012. All Rights Reserved.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks ofInternational Business Machines Corp., and registered in many jurisdictionsworldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web – see the IBM “Copyrightand trademark information” page at URL: www.ibm.com/legal/copytrade.shtml
top related