COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s18/CA_01_Intro.pdf · Computer Architecture Introduction and Organizational Issues ... John L. Hennessy,

1

COSC 6385

Computer Architecture

Introduction and Organizational Issues

Edgar Gabriel

Spring 2018

Organizational issues (I)

• Classes:

– Tuesday, 11.30am – 1.00pm, F162

– Thursday, 11.30am – 1.00pm, F162

• Evaluation as planned right now

– 1 homework: 25%

– 3 quizzes: 75% (25% each)

• In case of questions:

– email: [email protected]

– Tel: (713) 743 3358

– Office hours: PGH 228, Monday, 11am-11.45am or by appointment

• All slides available on the website:

– http://www.cs.uh.edu/~gabriel/courses/cosc6385_s18/

– Videos of some lectures will be posted on the course web page

mailto:[email protected]

http://www.cs.uh.edu/~gabriel/courses/cosc6385_s18/

2

Organizational Issues (III)• TA’s for the course:

– Guangli Dai, email: [email protected]

• Dates for the quizzes:

– 1st quiz: Thursday, Feb 15

– 2nd quiz: Tuesday, March 27

– 3rd quiz: Thursday, April 26

• Homework

– Announced: Tuesday, Feb 13

– Due on: Tuesday, March 6

Contents

• Textbook:

John L. Hennessy,

David A. Patterson

“Computer Architecture –

A Quantitative Approach”

6th Edition

Morgan Kaufmann Publishers

mailto:[email protected]

3

Contents (II)

• Most of chapters 1 – 5, and 7

– Memory Hierarchy Design

– Instruction Level Parallelism

– Data Level Parallelism

– Thread Level Parallelism

– Domain Specific Architectures

• Appendix B, C

– Review of Memory Hierarchies

– Pipelining

• Selected literature to multi-core processors, GPUs, and

storage systems

Why learn about Computer Architecture?

• Every loop iteration requires 3 memory operations

– 2 loads

– 1 store

• For a micro-processor having a frequency of 2 GHz this loop

requires

to satisfy one Floating Point Unit (FPU)

• Most modern processors have 2 FPUs and two or more Integer Units

which could work in parallel

• Most modern processors have more than one core that can operate

in parallel

for (i=0; i<n; i++ ) {

c[i] = a[i] + b[i];

}

sGBytessBytes /2410*2*4*3 19

4

Memory technology (www.kingston.com/newtech)

• Memory Bandwidth

with

CycleOpfSBSB BUSBus /**max

maxSB

BUSSB

BUSf

: max. memory bandwidth

: Bandwidth of the memory bus (64 Bit = 8 Bytes)

: Frequency of the memory bus

Memory modules

Source: https://www.kingston.com/us/memory/resources/ddr3_1600

https://www.kingston.com/us/memory/resources/ddr3_1600

5

Memory hierarchies

Size Access time

[cycles]

Backup (tape) TB, PT, EB

Primary data

storage (disk)

~ 10s TB > 106

main memory ~ 4-512 GB 100 - 1000

Caches ~ 1-32 MB 2 – 50

Register < 256 Words 1 - 2

Memory hierarchies

• Do I have to care about memory hierarchies?

• Example: Matrix-multiply of two dense matrices

– “Trivial” code

for ( i=0; i<dim; i++ ) {

for ( j=0; j<dim; j++ ) {

for ( k=0; k<dim; k++) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

6

Matrix-multiply

• Performance of the trivial implementation on an 2.2

GHz AMD Opteron

Matrix dimension Execution time

[sec]

Performance

[MFLOPS]

256x256 0.118 284

512x512 2.05 130

Matrix-multiply (II)

• Peak floating point performance of the processor

2 * (2.2 * 109) Floating point operations/sec

= 4.4 * 109

= 4.4 GFLOPS

• Where are the missing FLOPS between theoretical peek

and achieved performance?

– Memory wait time

Number of floating

point units

Frequency of the processor

→ assuming that each FPU

can finish one operation per

cycle

Theoretical floating point peak

performance of the processor

7

Blocked code

for ( i=0; i<dim; i+=block ) {

for ( j=0; j<dim; j+=block ) {

for ( k=0; k<dim; k+=block) {

for (ii=i; ii<(i+block); ii++) {

for (jj=j; jj<(j+block); jj++) {

for (kk=k; kk<(k+block);kk++) {

c[ii][jj] += a[ii][kk] * b[kk][jj];

}

}

}

}

}

}

Performance of the blocked codeMatrix

dimension

block Execution time

[sec]

Performance

[MFLOPS]

“trivial”

[MFLOPS]

256x256 4 0.065 513 284

8 0.046 726

16 0.51 657

32 0.043 777

64 0.049 677

128 0.113 296

512x512 4 0.686 391 130

8 0.422 635

16 0.447 599

32 0.501 535

64 1.00 266

128 0.994 269

8

9

Top 500 List

10

Trends: Cores and Threads per Chip

19

Source: SICS Multicore Day’ 14

Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

20


Slide source: Andy Semin ‘Intel processors and platforms roadmap for energy efficient HPC solutions’




11

21




„Big Core“ – „Small Core“

Intel® Xeon® ProcessorIntel® Xeon Phi™

Coprocessor

Simply aggregating more cores

generation after generation is not

sufficient

Optimized for highest compute per

watt

Performance per core/thread must

increase each generation, be as fast

as possible

Willing to trade performance per

core/thread for aggregate

performance

Power envelopes should stay flat or

go down each generation

Power envelopes should also stay

flat or go down every generation

Balanced platform (Memory, I/O,

Compute)

Optimized for highly parallel

workloads

Cores, Threads, Caches, SIMD Cores, Threads, Caches, SIMD

Different Optimization Points

Common Programming Models

and Architectural Elements

For illustration only

22





12

Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’

http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf

Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’

http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf

http://www.montblanc-project.eu/sites/default/files/publications/Are mobile processors ready for HPC.pdf

http://www.montblanc-project.eu/sites/default/files/publications/Are mobile processors ready for HPC.pdf

COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s18/CA_01_Intro.pdf · Computer Architecture Introduction and Organizational Issues ... John L. Hennessy,

Documents