Page 1
1
COSC 6385
Computer Architecture
Introduction and Organizational Issues
Edgar Gabriel
Spring 2018
Organizational issues (I)
• Classes:
– Tuesday, 11.30am – 1.00pm, F162
– Thursday, 11.30am – 1.00pm, F162
• Evaluation as planned right now
– 1 homework: 25%
– 3 quizzes: 75% (25% each)
• In case of questions:
– email: [email protected]
– Tel: (713) 743 3358
– Office hours: PGH 228, Monday, 11am-11.45am or by appointment
• All slides available on the website:
– http://www.cs.uh.edu/~gabriel/courses/cosc6385_s18/
– Videos of some lectures will be posted on the course web page
Page 2
2
Organizational Issues (III)• TA’s for the course:
– Guangli Dai, email: [email protected]
• Dates for the quizzes:
– 1st quiz: Thursday, Feb 15
– 2nd quiz: Tuesday, March 27
– 3rd quiz: Thursday, April 26
• Homework
– Announced: Tuesday, Feb 13
– Due on: Tuesday, March 6
Contents
• Textbook:
John L. Hennessy,
David A. Patterson
“Computer Architecture –
A Quantitative Approach”
6th Edition
Morgan Kaufmann Publishers
Page 3
3
Contents (II)
• Most of chapters 1 – 5, and 7
– Memory Hierarchy Design
– Instruction Level Parallelism
– Data Level Parallelism
– Thread Level Parallelism
– Domain Specific Architectures
• Appendix B, C
– Review of Memory Hierarchies
– Pipelining
• Selected literature to multi-core processors, GPUs, and
storage systems
Why learn about Computer Architecture?
• Every loop iteration requires 3 memory operations
– 2 loads
– 1 store
• For a micro-processor having a frequency of 2 GHz this loop
requires
to satisfy one Floating Point Unit (FPU)
• Most modern processors have 2 FPUs and two or more Integer Units
which could work in parallel
• Most modern processors have more than one core that can operate
in parallel
for (i=0; i<n; i++ ) {
c[i] = a[i] + b[i];
}
sGBytessBytes /2410*2*4*3 19
Page 4
4
Memory technology (www.kingston.com/newtech)
• Memory Bandwidth
with
CycleOpfSBSB BUSBus /**max
maxSB
BUSSB
BUSf
: max. memory bandwidth
: Bandwidth of the memory bus (64 Bit = 8 Bytes)
: Frequency of the memory bus
Memory modules
Source: https://www.kingston.com/us/memory/resources/ddr3_1600
Page 5
5
Memory hierarchies
Size Access time
[cycles]
Backup (tape) TB, PT, EB
Primary data
storage (disk)
~ 10s TB > 106
main memory ~ 4-512 GB 100 - 1000
Caches ~ 1-32 MB 2 – 50
Register < 256 Words 1 - 2
Memory hierarchies
• Do I have to care about memory hierarchies?
• Example: Matrix-multiply of two dense matrices
– “Trivial” code
for ( i=0; i<dim; i++ ) {
for ( j=0; j<dim; j++ ) {
for ( k=0; k<dim; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
Page 6
6
Matrix-multiply
• Performance of the trivial implementation on an 2.2
GHz AMD Opteron
Matrix dimension Execution time
[sec]
Performance
[MFLOPS]
256x256 0.118 284
512x512 2.05 130
Matrix-multiply (II)
• Peak floating point performance of the processor
2 * (2.2 * 109) Floating point operations/sec
= 4.4 * 109
= 4.4 GFLOPS
• Where are the missing FLOPS between theoretical peek
and achieved performance?
– Memory wait time
Number of floating
point units
Frequency of the processor
→ assuming that each FPU
can finish one operation per
cycle
Theoretical floating point peak
performance of the processor
Page 7
7
Blocked code
for ( i=0; i<dim; i+=block ) {
for ( j=0; j<dim; j+=block ) {
for ( k=0; k<dim; k+=block) {
for (ii=i; ii<(i+block); ii++) {
for (jj=j; jj<(j+block); jj++) {
for (kk=k; kk<(k+block);kk++) {
c[ii][jj] += a[ii][kk] * b[kk][jj];
}
}
}
}
}
}
Performance of the blocked codeMatrix
dimension
block Execution time
[sec]
Performance
[MFLOPS]
“trivial”
[MFLOPS]
256x256 4 0.065 513 284
8 0.046 726
16 0.51 657
32 0.043 777
64 0.049 677
128 0.113 296
512x512 4 0.686 391 130
8 0.422 635
16 0.447 599
32 0.501 535
64 1.00 266
128 0.994 269
Page 10
10
Trends: Cores and Threads per Chip
19
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
20
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘Intel processors and platforms roadmap for energy efficient HPC solutions’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
Page 11
11
21
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
„Big Core“ – „Small Core“
Intel® Xeon® ProcessorIntel® Xeon Phi™
Coprocessor
Simply aggregating more cores
generation after generation is not
sufficient
Optimized for highest compute per
watt
Performance per core/thread must
increase each generation, be as fast
as possible
Willing to trade performance per
core/thread for aggregate
performance
Power envelopes should stay flat or
go down each generation
Power envelopes should also stay
flat or go down every generation
Balanced platform (Memory, I/O,
Compute)
Optimized for highly parallel
workloads
Cores, Threads, Caches, SIMD Cores, Threads, Caches, SIMD
Different Optimization Points
Common Programming Models
and Architectural Elements
For illustration only
22
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
Page 12
12
Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’
http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf
Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’
http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf