Top Banner
1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002
43

1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Jan 02, 2016

Download

Documents

Cathleen Willis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

1

Parallel Applications

15-740

Computer Architecture

Ning Hu, Stefan Niculescu & Vahe Poladian

November 22, 2002

Page 3: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Application and Architectural Bottlenecks in Distributed

Shared Memory Multiprocessors

Page 4: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

4

Question

Can realistic applications achieve reasonable performance on large scale DSM machines?

Problem Size

Programming effort and Optimization

Main architectural bottlenecks

Page 5: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

5

Metrics

Minimum problem size to achieve a desired level of parallel efficiency

Parallel Efficiency >= 70%

Assumption: Problem size↗Performance ↗

Question: Is the assumption always true? Why or Why not?

usedProcessorsofNumber

tionImplementaSequentialoverSpeedupEfficiencyParallel

Page 6: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

6

Programming Difficulty and Optimization

Techniques already employed

Balance the workload

Reduce inherent communication

Incorporate major form of data locality(temporal and/or spatial)

Further optimization

Place the data appropriately in physically distributed memory instead of allowing pages of data to be placed round-robin across memories

Modify major data structures substantially to reduce unnecessary communication and to facilitate proper data placement

Algorithmic enhancements to further improve the load imbalance or reduce the amount of necessary communication

Prefetching

Software-controlled

Insert prefetches by hand

Page 7: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

7

Simulation Environment and Validation

Simulated Architecture

Stanford FLASH multiprocessor

Validation

Stanford DASH machine vs. Simulator

Speedups: Simulator ≈ DASH machine

Page 8: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

8

Applications

Subset of SPLASH2

Important scientific & engineering computations

Different communication patterns & requirements

Indicative of several types of applications running on large scale DSMs

Page 9: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

9

Results – With & w/o prefetching

Page 10: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

10

Architectural Bottleneck

Page 11: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

11

Conclusion

Problem size Possible to get good performance on large scale DSMs,

using problem sizes that are often surprisingly small, except Radix

Program difficulty In most case not difficult to program Scalable performance can be achieved without changing

the code too much

Architectural bottleneck End-point contention Require extremely efficient communication controllers

Page 12: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Page 13: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Scaling Application Performance on a Cache-coherent

Multiprocessor

Page 14: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

14

The Question

Can distributed shared memory, cache coherent, non-uniform memory access architectures scale on parallel apps?

What do we mean by scale:

Achieve parallel efficiency of 60%,

For a fixed problem size,

Increasing the number of processors,

Page 15: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

15

DSM, cc-NUMA

Each processor has private cache,

Shared address space constructed from “public” memory of each processor,

Loads / stores used to access memory,

Hardware ensures cache coherence,

Non-uniform: miss penalty for remote data higher,

SGI Origin2000 chosen as an aggressive representative in this architectural family,

Page 16: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

16

Origin 2000 overview

Nodes placed as vertices of hypercubes:

Ensures that communication latency grows linearly, as number of nodes doubles,

Each node is dual 195 MHz proc, with own 32KB 1st level cache, 4 MB second level cache

Total addressible memory is 32GB

Most aggressive in terms of remote to local memory access latency ratio

Page 17: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

17

Benchmarks

SPLASH-2: Barnes-Hut, Ocean, Radix Sort, etc,

3 new: Shear Warp, Infer, and Protein,

Range of communication-to-computation ratio, temporal and spatial locality,

Initial sizes of problems determined from earlier experiments:

Simulation with 256 processors,

Implementation with 32 processors,

Page 18: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

18

Initial Experiments

Page 19: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

19

Avg. Exec. Time Breakdown

Page 20: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

20

Problem Size

Idea:

increase problem size until desired level of efficiency is achieved,

Question:

Feasible?

Question:

Even if feasible, is it desirable?

Page 21: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

21

Changing Problem Size

Page 22: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

22

Why problem size helps

Communication to computation ratio improved

Less load imbalance, both in computation and communication costsLess waiting in synch

Superlienarity effects of cache sizeHelps larger processor counts,

Hurts smaller processor counts

Less false sharing

Page 23: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

23

Application Restructuring

What kind of restructuring:Algorithmic changes, data partitioning

Ways restructuring helps:Reduced communication,

Better data placement,

Static partitioning for better load balance,

Restructuring is app specific and complex,

Bonus side-effect:Scale well on Shared Virtual Memory (clustered

workstations) systems,

Page 24: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

24

Application Restructuring

Page 25: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

25

Conclusions

Original versions not scalable on cc-NUMA

Simulation not accurate for quantitative results; implementation needed

Increasing size a poor solution

App restructuring works:Restructured apps perform well also on SVM,

Parallel efficiency of these versions better

However, to validate results, good idea to run restructured apps on larger number of processors

Page 26: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Page 27: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

A Comparison of the MPI, SHMEM and Cache-coherent

Programming Models on the SGI Origin2000

Page 28: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

28

Purpose

Compare the three programming

models on Origin2000

We focus on scientific applications that

access data regularly or predictably

Or do not require fine grained

replication of irregularly accessed data

Page 29: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

29

SGI Origin2000

Cache coherent, NUMA machine

64 processors32 nodes, 2 MIPS R10000 processors

each

512 MB memory per node

Interconnection Network 16 vertices hypercube

Pair of nodes associated with each vertex

Page 30: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

30

Three Programming Models

CC-SAS

Linear address space for shared memory

MP

Communicate with other processes explicitly via message passing interface (MPI)

SHMEM

Shared memory one sided communication library

Via get and put primitives

Page 31: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

31

Applications and Algorithms

FFTAll-to-all communication(regular)

OceanNearest-neighbor communication

RadixAll-to-all communication(irregular)

LUOne-to-many communication

Page 32: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

32

Questions to be answered

Can parallel algorithms be structured in the same way for good performance in all three models?

If there are substantial differences in performance under three models, where are the key bottlenecks?

Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

Page 33: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

33

Performance Result

Page 34: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

34

Questions:

Anything unusual in the previous slide?

Working sets fit in the cache for multiprocessors but not for uniprocessors

Why MP is much worse than CC-SAS and SHMEM?

Page 35: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

35

Analysis:

Execution time = BUSY + LMEM + RMEM + SYNC

where

BUSY: CPU computation time

LMEM: CPU stall time for local cache miss

RMEM: CPU stall time for sending/receiving remote data

SYNC: CPU time spend at synchronization events

Page 36: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

36

Time breakdown for MP

Page 37: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

37

Improving MP performance

Remove extra data copy

Allocate all data involved in communication in shared address space

Reduce SYNC time

Use lock-free queue management instead in communication

Page 38: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

38

Speedups under Improved MP

Page 39: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

39

Question:

Why does CC-SAS perform best for small problem size?

Extra packing/unpacking operation in MP and SHMEM

Extra packet queue management in MP

Page 40: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

40

Speedups for Ocean

Page 41: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

41

Speedups for Radix

Page 42: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

42

Speedups for LU

Page 43: 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

43

Conclusions

Good algorithm structures are portable among programming models.

MP is much worse than CC-SAS and SHMEM.

We can achieve similar performance if extra data copy and queue synchronization are well solved.