Sorting on the Cray X1 Helene Kulsrud CCR-P
Sorting on the Cray X1
Helene Kulsrud
CCR-P
Sorting Methods
• Heap Sort• Bubble Sort• Insertion Sort• Shellsort• Mergesort• Quicksort
• RRRRaaaaddddiiiixxxx SSSSoooorrrrtttt
• RRRRaaaaddddiiiixxxx EEEExxxxcccchhhhaaaannnnggggeeeeSSSSoooorrrrtttt
• etc…• Parallel Sorts etc..
Sorting on the Cray X1
• QKSORT – A scalar sort (radix exchange)
• PSORT - A vectorized sort (radix)
• MSORT - A multiprocessor sort
11101011
1010
0100
11000110
11110011
00010101
1101
1000
01000000
0111
11101011
1010
0100
11000110
11110011
00010101
1101
1000
01000000
0111
11101011
1010
0100
11000110
11110011
0001
0101
11011000
01000000
0111
1110
10111010
0100
11000110
11110011
0001
0101
11011000
01000000
0111
11101011
1010
0100
11000110
1111
00110001
0101
11011000
01000000
0111
11101011
0010
0100
1100
0110
1111
00110001
0101
11011000
0100
0000
0111
00100100
0011
0001
0101
0100
0000
0111
01010100
0011
0001
0010
0100
0000
0111
01010100
0100
0001
0010
0011
0000
0111
01010100
0100
0111
0010
0011
0000
0001
0010
0100
0011
0001
0101
0100
0000
0111
0011
0100
0010
0000
01010100
0001
0111
QKSORT on Cray X1
• A version of standard radix sort.
• Basically a scalar code.
• Code has two distinct parts.
• Part 1: Interchange which uses vectorregisters.
• Part 2 -Short sort in vector registers.
• Two routines in assembly language.
Partitioning in the Vector Registers
new 64 wds 64 wds
64 wds 64 wds new
V0
V7
V1
V6
64 wds
V2
V4
Short Sort
1277
87933
124332214
55577823152966
V1 V2
……12
Short Sort
1111333377
87933
124332214
55577823152966
……12………………77
V1 V2
Improving QKSORT
27.208 5.987 2.784 .467 .217 .037.016Port
Version
4000000100000050000010000050000100005000Count
Improving QKSORT
1.824 .425 .205. 038.018.003.002New
27.208 5.987 2.784 .467 .217 .037.016Port
Version
4000000100000050000010000050000100005000Count
QKSORT in 2003
4.805 2.331 .425. 204.038.018.003 X1
2222....888888888888 1111....222277773333 ....111188880000....000088885555....000011116666....000000009999....000000001111AlphaEV6.8
4.756 2.242 .342.153.025.020.002Alpha
EV6.7
9.507 4.367 .855.427.083.041.008T90
1.024. 510.079.037.006T3E
10000000500000010000005000001000005000010000 SizeMachine
11101011
1101
0100
11000110
11110010
00010101
1000
01000000
0111
11101011
1101
0100
1100
0010
1111
00010101
1000
01000000
0111
0110
Counts in Right 2 bit pocket
11 3 11
8 3 10
5 3 01
0 5 00
AddressCountValue
11101011
0100
11000110
1111
00010101
1000
01000000
0111
1011
00100110
1111
0001
0101
0111
1000
01001100
01000000 1101
1101
0010
1110
11101011
0100
11000110
1111
00010101
1000
01000000
0111
0111
0110
1011
0101
0001
0010
1110
1111
1000
01001100
01000000
0010
1101
1101
Counts in left 2 bit pocket
10 4 11
8 2 10
3 5 01
0 3 00
AddressCountValue
1011
0010
01101110
00010101
0111
1101
1011
0111
0100
1100
0101
1111
0010
0001
0110
1000
0100
0000
1000
01001100
01000000
00000000........
00001111........
11110000........
11111111....
11101101
1111
PSORT on the Cray X1
• A radix sort .• It requires double memory.• Three major parts to the code.• A vector code.• The address calculation used most time. word[j+1] = word[j]+ word[j+1]• Write address calculation in assembly
language.• Two versions – scalar and 2 vectors
Two vectors Address Update
k=(n-1)/63+1
for(j=1;j<k;j++) for(i=0;i<63;i++) word[k*i+j+1] = word[k*i+j+1] +word[k*i+j]
for(i=1;i<63;i++) for(j=1;j<k;j++) word[k*i+j] = word[k*i] + word[k*i+j}
Improving PSORT
1.2252.4972 .4142 .0722.0360.0047.0049Ported
4000000100000050000010000050000100005000Count
Version
Improving PSORT
.7838.2484 .1725 .0327.0159.0028.0013V-S Loop
1.2252.4972 .4142 .0722.0360.0047.0049Ported
4000000100000050000010000050000100005000Count
Version
Improving PSORT
.7838.2484 .1725 .0327.0159.0028.0013V-S Loop
1.2252.4972 .4142 .0722.0360.0047.0049Ported
.7080.2015 .1273 .0210.0092.0027.00122V Loops
4000000100000050000010000050000100005000Count
Version
PSORT in 2003
1.684 .866 .202 .127.021.009.003 X1
16.536 7.636 .934 .394.070.023.003AlphaEV6.8
27.604 12.801 3.016 1.490.120.040.004AlphaEV6.7
1111....33331111 ....777700002222 ....111188883333 ....000099990000....000011119999....000011117777....000000003333T90
6.371 .3.307.747.377 .037T3E
10000000500000010000005000001000005000010000 SizeMachine
SSSSOOOORRRRTTTT
SSSSOOOORRRRTTTT
SSSSOOOORRRRTTTT
SSSSOOOORRRRTTTT
SSSSOOOORRRRTTTT
SSSSOOOORRRRTTTT
PE 0
PE 5
PE 4PE 2
PE 1
PE 3
TTTTiiiilllleeeessss
TTTTiiiilllleeeessss
TTTTiiiilllleeeessss
TTTTiiiilllleeeessss
TTTTiiiilllleeeessss
TTTTiiiilllleeeessss
PE 0
MMMMeeeerrrrgggg
eeee
MMMMeeeerrrrggggeeee
MMMMeeeerrrrggggeeee
MMMMeeeerrrrggggeeee
MMMMeeeerrrrggggeeee
MMMMeeeerrrrggggeeee
Improving MSORT
• Slow time due to SHMEM.
• May be due to barrier code.
• PSORT is faster than QKSORT.
• We need double block of memory formerge.
• However, times are similar since mostof time is spent in moving data.
MSORT in 2003
1.098 .099 .068 .004T3E - 16
2.292 .348 .048 .042X1 - 16
.650 .064 .014X1 - 4
4.011 .492 .050 .033X1 - 8
.755 .074 .085 .009T3E - 24
1.661 .142 .025 .002T3E - 8
1111....666677774444 ....111199991111 ....000022226666 .109X1 - 24
.955 .091 ....000011113333X1 - 2
.263 .023 .002T3E - 4
.488 .040 .003T3E - 2
10,000,0001,000,000100,00010,000 Size
What next?
• Can we think of a way to usemultistreaming?
• If so – should we do it?
• Algorithms using SPP mode.
• Suggest compiler improvements toCray.
What was the Point?
• Early exercise on the X1.
What was the Point?
• Early exercise on the X1.
• Produce good sorts for the X1.
What was the Point?
• Early exercise on the X1.
• Produce good sorts for the X1.
• Found compiler weaknesses.
What was the Point?
• Early exercise on the X1.
• Produce good sorts for the X1.
• Found compiler weaknesses.
• Found machine strengths.