Top Banner
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms Marek Olszewski and Michael Voss ECE Department University of Toronto
20

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Dec 26, 2015

Download

Documents

Andrew Conley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

The

GroupRuntime Optimization for

High-Performance Computing

An Install-Time System for Automatic Generation ofOptimized Parallel Sorting Algorithms

Marek Olszewski and Michael Voss

ECE Department

University of Toronto

Page 2: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Motivation

Sorting is a fundamental algorithm Many algorithmic choices for sorting Performance heavily influenced by

Data being sorted (type, entropy) Target machine being used

How can we build the best sort for a given machine? An empirical install-time system

Page 3: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Outline of Talk

Motivation An Overview of Sorting Algorithms Our install-time empirical system

An adaptive hybrid sequential sort An adaptive hybrid parallel sort

An Evaluation Related Work Conclusions

Page 4: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

An overview of sorting algorithms Art of Computer Programming V3 (Knuth)

25 algorithms comprehensively studied Comparison sorts

Lower bound shown to be (n log n) Examples include: insertion sort, quick sort

and merge sort Non-comparison sorts

Can be linear time, i.e. O(n) But require knowing the range of the data Examples include: radix sort and bucket sort

Page 5: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

An overview of sorting algorithms Hybrid sorts

Divide and conquer sorts are recursive May be beneficial to switch algorithms Most C++ STL sorts are hybrid sorts

Gnu std::sort is a hybrid sort with pre-defined points to switch between heap sort, quick sort, merge sort and insertion sort

Page 6: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

An overview of parallel sorts Ideally, O( (n log n) / p)

If p = n, then O( log n) Several parallel sorts demonstrate this bound, e.g.

Column sort Parallelized sequential sorts often better for low numbers

of processors (our focus).

Parallelized divide and conquer algorithms Effective for small numbers of processors Use a work-queue model Tasks are place in a shared work-queue Idle processors remove tasks from the queue Good load balance

Page 7: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Our install-time system

Start Sample input dataprovided to installer

Specialized decision Function place in library

Time SortsRandom algorithms

at each recursive step

Calculate best sortingalgorithm for each

data aet size

Convert tree to C++

C4.5 createsdecision tree

End

End

Parallel?Time Sorts

Different input sizesand work-share points

Work-share cutoffpoint tree and C++

functions generated

Page 8: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Algorithms available to our hybrid sort:

Algorithm Description

Insertion Sort O(n2) but with small lower order terms. Efficient for small lists.

Merge Sort O(n log n). Subtasks evenly divided by has higher lower-order terms than quick sort.

Quick Sort O(n log n) on average, but is O(n2) worst-case. Has smaller lower-order terms than merge sort.

In-place Merge Sort

O(n log n). Higher constant coefficient than merge sort, but uses less memory.

Heap Sort O(n log n). Non-recursive algorithm. Can do well on medium sized lists. Higher lower-order terms than quick sort.

Page 9: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Hybrid Adaptive Sequential Sort

Use random data to train system Up to 10 million elements Insertion sort not used for large inputs Not all inputs sorted to completion

Dynamic programming used to find best choice Assume best sort at each subsequent step Per step timings were measured

C4.5 decision tree used to analyze this data C4.5 tree converted to C++ template code

Page 10: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Hybrid Adaptive Parallel Sort

Start with sequential hybrid sort Determine work-sharing cutoff point

When should a thread execute its own tasks When should a thread place tasks in work queue

Determines the point at which synchronization costs are no longer amortized by small work

Page 11: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Methodology: Platforms

Sequential platforms Linux 2.4.18 Intel Penitum 4 1.6 GHz Xeon Linux 2.4.24 AMD Athlon XP 1700+ SunOS 5.8 on a 600 MHz Sparc Workstation

Parallel platform 4 processor 1.6 GHz Intel Xeon SMP Modified 2.4.18-smp kernel (allowed binding)

Page 12: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Methodology: Comparisons

Adaptive Hybrid Sequential Sort Adaptive Hybrid Parallel Sort Gnu G++ 2.96 std::sort and std::stable_sort

Also hybrid sorts Complex – not easily parallelized

8 equally sized merge sorts that called std::sort and std::stable_sort in parallel

Page 13: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Serial Non-Optimized (w/o –O) Results

Page 14: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Serial Optimized (w –O) Results

Page 15: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Parallel Work-share Cutoff Point

Page 16: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Parallel Non-Optimized (w/o –O) Results

Page 17: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Parallel Optimized (with –O) Results

Page 18: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Parallel Sort Speedups

Page 19: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Related Work

Install-time empirical optimization systems ATLAS: Level 3 BLAS FFTW: FFT

STAPL: Adaptive Parallel C++ Library Uses decision trees like our approach Uses only single-level sorts, not hybrids Not available for comparison

A Dynamically Tuned Sorting Library (CGO’04) Install-time tuning of sequential sorts Only single-level sorts, not hybrid

Page 20: The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

PDPTA 2004

Conclusion Presented an install-time system for

empirically constructing a “best” sorting algorithm for a target machine

Competitive with STL sort on 1 processor Better than a parallelized STL sort on

multiple processors