Sorting on the Cray X1 - Cray User Group...QKSORT on Cray X1 •A version of standard radix sort. • Basically a scalar code. • Code has two distinct parts. • Part 1: Interchange

Sorting on the Cray X1

Helene Kulsrud

CCR-P

Sorting Methods

• Heap Sort• Bubble Sort• Insertion Sort• Shellsort• Mergesort• Quicksort

• RRRRaaaaddddiiiixxxx SSSSoooorrrrtttt

• RRRRaaaaddddiiiixxxx EEEExxxxcccchhhhaaaannnnggggeeeeSSSSoooorrrrtttt

• etc…• Parallel Sorts etc..

Sorting on the Cray X1

• QKSORT – A scalar sort (radix exchange)

• PSORT - A vectorized sort (radix)

• MSORT - A multiprocessor sort

11101011

1010

0100

11000110

11110011

00010101

1101

1000

01000000

0111

11101011

1010

0100

11000110

11110011

00010101

1101

1000

01000000

0111

11101011

1010

0100

11000110

11110011

0001

0101

11011000

01000000

0111

1110

10111010

0100

11000110

11110011

0001

0101

11011000

01000000

0111

11101011

1010

0100

11000110

1111

00110001

0101

11011000

01000000

0111

11101011

0010

0100

1100

0110

1111

00110001

0101

11011000

0100

0000

0111

00100100

0011

0001

0101

0100

0000

0111

01010100

0011

0001

0010

0100

0000

0111

01010100

0100

0001

0010

0011

0000

0111

01010100

0100

0111

0010

0011

0000

0001

0010

0100

0011

0001

0101

0100

0000

0111

0011

0100

0010

0000

01010100

0001

0111

QKSORT on Cray X1

• A version of standard radix sort.

• Basically a scalar code.

• Code has two distinct parts.

• Part 1: Interchange which uses vectorregisters.

• Part 2 -Short sort in vector registers.

• Two routines in assembly language.

Partitioning in the Vector Registers

new 64 wds 64 wds

64 wds 64 wds new

V0

V7

V1

V6

64 wds

V2

V4

Short Sort

1277

87933

124332214

55577823152966

V1 V2

……12

Short Sort

1111333377

87933

124332214

55577823152966

……12………………77

V1 V2

Improving QKSORT

27.208 5.987 2.784 .467 .217 .037.016Port

Version

4000000100000050000010000050000100005000Count

Improving QKSORT

1.824 .425 .205. 038.018.003.002New

27.208 5.987 2.784 .467 .217 .037.016Port

Version

4000000100000050000010000050000100005000Count

QKSORT in 2003

4.805 2.331 .425. 204.038.018.003 X1

2222....888888888888 1111....222277773333 ....111188880000....000088885555....000011116666....000000009999....000000001111AlphaEV6.8

4.756 2.242 .342.153.025.020.002Alpha

EV6.7

9.507 4.367 .855.427.083.041.008T90

1.024. 510.079.037.006T3E

10000000500000010000005000001000005000010000 SizeMachine

11101011

1101

0100

11000110

11110010

00010101

1000

01000000

0111

11101011

1101

0100

1100

0010

1111

00010101

1000

01000000

0111

0110

Counts in Right 2 bit pocket

11 3 11

8 3 10

5 3 01

0 5 00

AddressCountValue

11101011

0100

11000110

1111

00010101

1000

01000000

0111

1011

00100110

1111

0001

0101

0111

1000

01001100

01000000 1101

1101

0010

1110

11101011

0100

11000110

1111

00010101

1000

01000000

0111

0111

0110

1011

0101

0001

0010

1110

1111

1000

01001100

01000000

0010

1101

1101

Counts in left 2 bit pocket

10 4 11

8 2 10

3 5 01

0 3 00

AddressCountValue

1011

0010

01101110

00010101

0111

1101

1011

0111

0100

1100

0101

1111

0010

0001

0110

1000

0100

0000

1000

01001100

01000000

00000000........

00001111........

11110000........

11111111....

11101101

1111

PSORT on the Cray X1

• A radix sort .• It requires double memory.• Three major parts to the code.• A vector code.• The address calculation used most time. word[j+1] = word[j]+ word[j+1]• Write address calculation in assembly

language.• Two versions – scalar and 2 vectors

Two vectors Address Update

k=(n-1)/63+1

for(j=1;j<k;j++) for(i=0;i<63;i++) word[k*i+j+1] = word[k*i+j+1] +word[k*i+j]

for(i=1;i<63;i++) for(j=1;j<k;j++) word[k*i+j] = word[k*i] + word[k*i+j}

Improving PSORT

1.2252.4972 .4142 .0722.0360.0047.0049Ported

4000000100000050000010000050000100005000Count

Version

Improving PSORT

.7838.2484 .1725 .0327.0159.0028.0013V-S Loop

1.2252.4972 .4142 .0722.0360.0047.0049Ported

4000000100000050000010000050000100005000Count

Version

Improving PSORT

.7838.2484 .1725 .0327.0159.0028.0013V-S Loop

1.2252.4972 .4142 .0722.0360.0047.0049Ported

.7080.2015 .1273 .0210.0092.0027.00122V Loops

4000000100000050000010000050000100005000Count

Version

PSORT in 2003

1.684 .866 .202 .127.021.009.003 X1

16.536 7.636 .934 .394.070.023.003AlphaEV6.8

27.604 12.801 3.016 1.490.120.040.004AlphaEV6.7

1111....33331111 ....777700002222 ....111188883333 ....000099990000....000011119999....000011117777....000000003333T90

6.371 .3.307.747.377 .037T3E

10000000500000010000005000001000005000010000 SizeMachine

SSSSOOOORRRRTTTT

SSSSOOOORRRRTTTT

SSSSOOOORRRRTTTT

SSSSOOOORRRRTTTT

SSSSOOOORRRRTTTT

SSSSOOOORRRRTTTT

PE 0

PE 5

PE 4PE 2

PE 1

PE 3

TTTTiiiilllleeeessss






PE 0

MMMMeeeerrrrgggg

eeee

MMMMeeeerrrrggggeeee





Improving MSORT

• Slow time due to SHMEM.

• May be due to barrier code.

• PSORT is faster than QKSORT.

• We need double block of memory formerge.

• However, times are similar since mostof time is spent in moving data.

MSORT in 2003

1.098 .099 .068 .004T3E - 16

2.292 .348 .048 .042X1 - 16

.650 .064 .014X1 - 4

4.011 .492 .050 .033X1 - 8

.755 .074 .085 .009T3E - 24

1.661 .142 .025 .002T3E - 8

1111....666677774444 ....111199991111 ....000022226666 .109X1 - 24

.955 .091 ....000011113333X1 - 2

.263 .023 .002T3E - 4

.488 .040 .003T3E - 2

10,000,0001,000,000100,00010,000 Size

What next?

• Can we think of a way to usemultistreaming?

• If so – should we do it?

• Algorithms using SPP mode.

• Suggest compiler improvements toCray.

What was the Point?

• Early exercise on the X1.

What was the Point?


• Produce good sorts for the X1.

What was the Point?



• Found compiler weaknesses.

What was the Point?



• Found compiler weaknesses.

• Found machine strengths.

Sorting on the Cray X1 - Cray User Group...QKSORT on Cray X1 •A version of standard radix sort. • Basically a scalar code. • Code has two distinct parts. • Part 1: Interchange

Documents