Page 1
University of Colorado at BoulderCore Research Lab
FastForward for Efficient Pipeline Parallelism:FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free QueueA Cache-Optimized Concurrent Lock-Free Queue
Tipp Moseley and Manish VachharajaniUniversity of Colorado at Boulder
2008.02.21
John Giacomoni
Page 2
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Why?Why?Why Pipelines?Why Pipelines?
• Multicore systems are the future• Many apps can be pipelined if the
granularity is fine enough– ≈ < 1 µs– ≈ 3.5 x interrupt handler
Page 3
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Fine-GrainFine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
Page 4
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Network ProcessingNetwork ProcessingScenariosScenarios
Link Mbps fps ns/frame
T-1 1.5 2,941 340,000
T-3 45.0 90,909 11,000
OC-3 155.0 333,333 3,000
OC-12 622.0 1,219,512 820
GigE 1,000.0 1,488,095 672
OC-48 2,500.0 5,000,000 200
10 GigE 10,000.0 14,925,373 67
OC-192 9,500.0 19,697,843 51
Page 5
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Core-PlacementsCore-Placements
4x4 NUMA Organization(ex: AMD Opteron Barcelona)
APP
IP OP
Dec Enc
APP
IP
APP
OP
IP
Dec
App
Enc
OP
Page 6
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
Page 7
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
Page 8
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Page 9
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
Page 10
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
Lamport 160ns
Page 11
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10ns
GigE
Page 12
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
Page 13
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
More Fine-GrainMore Fine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
• Signal Processing– Media transcoding/encoding/decoding– Software Defined Radios
• Encryption– Counter-Mode AES
• Other Domains– Fine-grain kernels extracted from sequential applications
Page 14
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForward
• Cache-optimized point-to-point CLF queue1.Fast2.Robust against unbalanced stages3.Hides die-die communication4.Works with strong to weak memory consistency
models
Page 15
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
lamp_dequeue(*data) {
while (head == tail) {} *data = buf[tail]; tail = NEXT(tail);}
Page 16
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Page 17
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
AMD OpteronAMD OpteronCache ExampleCache Example
M
Page 18
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation
Page 19
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Lamport’sLamport’sCLF Queue (3)CLF Queue (3)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe how cachelines will still ping-pong.What if the head/tail comparison was eliminated?
tail
Page 20
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForwardCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;head = NH;
}
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
Page 21
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
buf[1]buf[0]
FastForwardFastForwardCLF Queue (2)CLF Queue (2)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Observe how head/tail cachelines will NOT ping-pong.BUT, “buf” will still cause the cachelines to ping-pong.
Page 22
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
FastForwardFastForwardCLF Queue (3)CLF Queue (3)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Solution: Temporally slip stages by a cacheline.N:1 reduction in coherence misses per stage.
Page 23
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Slip TimingSlip Timing
Page 24
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Slip TimingSlip TimingLostLost
Page 25
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Maintaining SlipMaintaining Slip(Concepts)(Concepts)
• Use distance as the quality metric– Explicitly compare head/tail– Causes cache ping-ponging– Perform rarely
Page 26
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Maintaining SlipMaintaining Slip(Method)(Method)
adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); }}
Page 27
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ComparativeComparativePerformancePerformance
Lamport FastForward
Page 28
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Thrashing andThrashing andAuto-BalancingAuto-Balancing
FastForward (Thrashing) FastForward (Balanced)
Page 29
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
CacheCacheVerificationVerification
FastForward (Thrashing) FastForward (Balanced)
Page 30
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
On/Off DieOn/Off DieCommunicationsCommunications
M
On-die communicationOff-die communication
Page 31
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
On/Off-dieOn/Off-diePerformancePerformance
FastForward (On-Die) FastForward (Off-Die)
Page 32
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
ProvenProvenPropertyProperty
• “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”
Page 33
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
WorkWorkin Progressin Progress
• Operating Systems– 27.5 ns/op
• 3.1 % cost reduction vs. reported 28.5 ns– Reduced jitter
• Applications– 128bit AES encrypting filter
• Ethernet layer encryption at 1.45 mfps• IP layer encryption at 1.51 mfps• ~10 lines of code for each.
Page 34
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Gazing intoGazing intothe Crystal Ballthe Crystal Ball
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
Page 35
University of Colorado at BoulderCore Research LabUniversity of Colorado at BoulderCore Research Lab
Shared Memory Accelerated QueuesNow Available!
http://ce.colorado.edu/core
[email protected]