Liron Schiff * (TAU) Joint work with Yehuda Afek, Anat Bremler-Barr (TAU) (IDC) Recursive Design of Hardware Priority Queues ed by European Research Council (ERC) Starting Grant no. 259085
Feb 16, 2016
Liron Schiff * (TAU)Joint work with
Yehuda Afek, Anat Bremler-Barr(TAU) (IDC)
Recursive Design of Hardware Priority Queues
∗Supported by European Research Council (ERC) Starting Grant no. 259085
• Interface:– PQ.Insert()
• The higher the priority of , the smaller is– PQ.GetMin(): remove and return
– PQ.Delete(): just remove– PQ.Peek(): just return minimum
Priority Queue (PQ)
Priority
QueueInser
tGetMi
n
• Networking: Scheduling Packets– Many flows (1M)– High rate (100Mpps)
More Application: Scientific Simulators, Databases
Priority Queue Applications
Priority
Queue ( s c h e d u l
e r )
14
33
913
24 1
927
42
55 1
638
7 25
Two Existing ApproachesDedicated HardwareSolutions
Common SoftwareSolutions
: Fast : Slow
Non-Scalable Scalable
Merge-Sort concept:
Our Approach: The Powering Technique
Base Priority Queue (BPQ)
size HW PQ3 x + size
RAM =
Sort
Merge
√𝑵
√𝑵
Size PQ
The Powering Technique• Insert(x) uses Input
Input
BPQ
Exit BPQ
3
The Powering Technique• Insert(x) uses Input
Input
BPQ
Exit BPQ
0
3
The Powering Technique• Insert(x) uses Input
Input
BPQ
Exit BPQ0
35
The Powering Technique• When Input gets full move to Exit.
Input
BPQ
Exit BPQ0
3
5
√𝑵
The Powering Technique• When Input gets full move to Exit.
Input
BPQ
Exit BPQ0
3
5
4
7
8
The Powering Technique• When Input gets full move to Exit.
Input
BPQ
Exit BPQ0
3
5
4
7
8
1
2
6
√𝑵
The Powering Technique• Get_min() extracts the min of Exit or Input
Input
BPQ
Exit BPQ0
3
5
4
7
8
1
2
6
9
min
The Powering Technique• Get_min() extracts the min of Exit or Input
Input
BPQ
Exit BPQ
0
3
5
4
7
8
1
2
6
9
and we update the Exit (if needed).min
• Difficulties with the Simple idea
• Applying the construction recursively
• Exemplifying on TCAM base units
• Evaluation
Outline
1. More than lists in exit module (As lists are emptied, and capacity N is maintained)
2. Move a list in O(1) op’s from Input to Exit
Two difficulties with the simple idea
Input
Exit
√𝑵
√𝑵
¿𝑵
Difficulty 1• Maintaining capacity N, while lists are
shrinking
Input
BPQ
Exit BPQ3
5
4
7
8
1
2
6
9
Difficulty 1• Maintaining capacity N, while lists are
shrinking
Input
BPQ
Exit BPQ3
5
4
7
8
1
2
6
9
• We continually merge inactive lists during Insert
Difficulty 1• Maintaining capacity N, while lists are
shrinking
Input
BPQ
Exit BPQ3
54
7
8
1
2
6
9
• We continually merge inactive lists during Insert
10
Difficulty 1• Maintaining capacity N, while lists are
shrinking
Input
BPQ
Exit BPQ3
54
7
8
1
2
6
9
• We continually merge inactive lists during Insert
10
11
Difficulty 1• Maintaining capacity N, while lists are
shrinking
Input
BPQ
Exit BPQ3
5
4
7
8
1
2
6
• We continually merge inactive lists during Insert
9
10
11
Difficulty 2• Moving all items from input to RAM in O(1)
time
Exit BPQ
Input
BPQ
Difficulty 2• Moving all items from input to RAM in O(1)
time– Use two Input BPQs and switch between them
Exit BPQ
Input
BPQ
Input
BPQs
Buffers
Difficulty 2• Moving all items from input to RAM in O(1)
time– Use two Input BPQs and switch between them
Exit BPQ
Input
BPQ
Input
BPQ
Buffers
Difficulty 2• Moving all items from input to RAM in O(1)
time– Use two Input BPQs and switch between them
Exit BPQ
Input
BPQ
Input
BPQ
Buffers
Difficulty 2• Moving all items from input to RAM in O(1)
time– Use two Input BPQs and switch between them
Exit BPQ
Input
BPQ
Input
BPQ
Buffers
Block Size – Time Tradeoff• Apply the construction recursively
– We used Exit and Input
Exit BPQ
Input
BPQInpu
t BPQ
√𝑵
√𝑵
Block Size – Time Tradeoff• Apply the construction recursively
– We used Exit and Input– We can use Exit and Input
Exit BPQ
Input
BPQ
Input
BPQ
3√𝑁
3√𝑁 2
Block Size – Time Tradeoff• Apply the construction recursively
– We used Exit and Input– We can use Exit and Input– We can build each Input recursively
Exit BPQ
Input
BPQ
Input
BPQ
3√𝑁
3√𝑁 2
Exit BPQ
3√𝑁
Input
BPQ
Input
BPQ3√𝑁
3√𝑁
Block Size – Time Tradeoff
Exit BPQ
Input
BPQ
Input
BPQ
3√𝑁
3√𝑁 2
Exit BPQ
Input
BPQ
Input
BPQ3√𝑁
3√𝑁
Exit BPQ
Input
BPQ
Input
BPQ3√𝑁
3√𝑁
Block Size – Time Tradeoff
Exit BPQ
Input
BPQ
Input
BPQ
3√𝑁
3√𝑁 2
Exit BPQ
Input
BPQ
Input
BPQ
Exit BPQ
Input
BPQ
Input
BPQInser
t
Insert
Block Size – Time Tradeoff• A Systolic Array like design:
Exit BPQ
𝑥
RAM
Buf
Buf
Exit BPQ
RAM
𝑁𝑥2
𝑁𝑥2
𝑥
Exit BPQ
RAM
Exit BPQ
𝑵𝒙𝟐
𝑥
Exit BPQ
𝑵𝒙𝟐
…Input
BPQ
Input
BPQ𝑥𝑥
𝑁𝑥3
𝑁𝑥3
Exit BPQ
𝒙𝟐
𝑥
Exit BPQ
𝒙𝟐
𝑥
in
Resulting Tradeoffs
Parallel op. Time (Latency)
#BPQ Ops. (per op.)
#Queues * Size
Recursion Levels
.
.
.
.
.
.
.
.
.
.
.
.
TCAM example
• Associative Memory chips:
• Properties:– Ternary values (‘0’,’1’ and ‘*’)– Already used in routers (IP lookup, classification)– High throughput (300M ops per sec for 1Mb TCAM)– Latency and costs increase dramatically with size
Ternary CAMs (TCAMs)
0*10**1*001001
1111***011
01010110
in
012
m0001001
11out
entry data entry index
• Implied by Panigrahy & Sharma (2003)• Three versions:
A. O(1) time but O(w) entries per item(where w is the width of a priority value in bits)
B. O(log w) timeC. “Empirical O(1)” time but O(w) on w.c.
TCAM based Priority Queue
BPQ
Space (TCAM bits)
Time (TCAM ops.)
Latency(TCAM ops.)
original
• Implied by Panigrahy & Sharma (2003)• Our results:
TCAM based Priority Queue
PoweringPowering
• Using small TCAM-based PQs– Faster TCAM access– Feasible even when N is large
• Suits well backbone routers– TCAMs are already used for IP-lookup
Powering the TCAM BPQ
Results for TCAM-based PQ
Size limit
50 400
3200
1000
1300
1600
1900
100,0001,000,000
10,000,000100,000,000
1,000,000,000TCAM Space
N (thousands of items)
TCA
M S
pace
(K
b)
50 100
200
400
800
1600
3200
050
100150200
Throughput
N (thousands of items)
Mpp
s
k=2
k=1
ABC
Applying to Shift-Registers
1,00
02,
000
4,00
08,
000
16,0
0032
,000
64,0
0012
8,00
025
6,00
051
2,00
01,
024,
0000
50
100
150
200Throughput
SR-BPQSR_PPQ(2)SR-PPQ(3)
N (thousands of items)
Mpp
s
Size limit
• Considering a HW PQ implementation of R. Chandra and O. Sinnen.
OriginalK=1K=2
Summary
• The Powering Technique– Combine Small HW queues and RAM– Allows space – time tradeoffs
• Powering TCAMs– Smaller TCAMs shorter operation time– Matches lower bound for sorting with TCAM– Also works for Shift Registers