1 Multithreaded Programming Concepts 2010. 3. 12 Myongji University Sugwon Hong 1.

Multithreaded Programming Concepts

2010. 3. 12Myongji University

Sugwon Hong

Why Multi-Core?

Until recently increasing clock frequency is the holy grail to all processor designers to boost performance.

But it seems that they reach the dead end for raising clock speed because of power consumption and overheating.

So, they realize that it is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency.

Power and Frequency

(source : Intel Academy program)

A little bit of history

In the past, performance scaling in single-core processors was achieved by increasing the clock frequency.

When processors shrink and clock frequencies rise, Excess power consumption, and

overheating Memory access time failed to keep pace

with increasing clock frequencies.

Instruction/data-level parallelism Since 1993, processor designers

supported parallel execution at instruction and data level.

Instruction-level parallelism Out-of-order execution pipeline and multiple

functional units to execute instructions in parallel

Data-level parallelism Multimedia Extension (MMX) in 1997 Streaming SIMD Extension (SSE)

Hyper-Threading

In 2002, Intel utilized additional copies of execution resources to execute two separate threads simultaneously on the same processor core.

This multi-threading idea eventually lead to introducing dual-core processor in 2005.

Evolution of Multi-Core Technology

Multi-processors Architecture Shared memory multiprocessor

(SMP) Non-shared memory architecture

Massively Parallel Processor (MPP) Cluster

CPU CPU CPU CPU CPU

Shared memory

CPU CPU CPU CPU CPU

Interconnected

memoryMPP

Multi-processors vs. Multi-cores Shared memory multi-processors

(SMP) Multiple thread on a single core

(SMT) Multiple thread on multi-cores (CMT)

Tricky acronym

CMP (Chip Multi-processor)

SMT (Simultaneous MultiThreading)

CMT (Chip-level MultiThreading)

CMT processor products

1st generation: Sun Microsystems (late 2005)

Intel Dual-Core Xeon (2005) Intel Quad-Core Xeon (late 2006) AMD Quad-Core Opteron (2007) 8-Core (??)

Thread

A thread is a sequential flow of instructions executed within a program.

Thread vs. Process A single process always has one main

thread which initialize the process and begins executing the instructions.

Any thread can create other threads within a process which share code and data segments. But each thread has its own stack.

Thread in a Processprocess

Why use threads?

Threads are intended to improve performance and responsiveness of a program.

Quick turnaroud time Completing a single job in the smallest

amount of time possible High throughput

Finishing the most tasks in a fixed amount of time

Risks of using Threads But if they are not used properly, they can

lead to degrade performance, and sometimes unpredictable behavior, and

error conditions Data race (race conditions) Deadlock

And other extra burdens. Code complexity Portability issues Testing and debugging difficulty

Race condition

It happens when more than two threads access a shared variable.

“It is nondeterministic!”

For example, when Tread A and Tread B are executing the statement.

area = area + 4.0 / (1.0 + x*x)

16(source : Intel Academy program)

How to deal with race condition Synchronization

Critical region Mutual exclusion

Concurrency vs. Parallelism

Generally two terminologies can be used interchangeably. But conventional wisdom has the following distinction.

Concurrency It happens when more than two threads are

in progress simultaneously, normally on a single processor.

Parallelism It occurs when more than two threads are

executed simultaneously on multiple cores.

Performance criteria

Speedup Efficiency Granularity Load balance

Speedup

The most noticeable quantitative measure is to compare the execution time of the best serial algorithm with that of the parallel algorithm.

Speedup = Ts/Tp

Ts = Serial Time, Tp = Parallel Time

Amdahl’s Law

Speedup = 1/[S+(1-S)/n + H(n)]

S: percentage of time spent on executing the serial portion

H(n) : parallel overhead

n: the number of cores

Example

Consider painting a fence. Suppose it takes 30 min to get ready to paint and 30 min for cleanup after painting. Assume that it takes 1 min to paint one single picket and there are 300 pickets. What are the speedups when 1, 2, 10, 100 painters do this job respectively? What is the maximum speedup?

What if you use a spray gun to paint the fence? What happens if the fence owner uses spray gun to paint 300 pickets in 1 hrs?

Parallel Efficiency

A measure of how efficiently core resources are used during parallel computations

In the previous example, assume that you knew that all painters were only busy for an average of less than 6% of entire job time but are still getting paid for the whole time. Do you think you were getting the money’s worth from the 100 painters?

Efficiency = (Speedup / Number of Threads) * 100%

Granularity

The ratio of computation to synchronization

Coarse-grained Concurrent threads have a large amount of

computation between synchronization events. Fine-grained

Concurrent threads have a very little computation between synchronization events.

Load Balance

Balancing the workloads among multiple threads

If more work is assigned to some threads, they will sit idle until other threads with more work finish.

All the cores must be busy to get max. performance.

For load balancing, which size of task will be better? Large-sized or small-sized?

Flash Demo

Computer Memory Hierarchy

L1 cache

L2 cache Main

memory

1’s cycle 1’s ~10 cycle

~100’s cycle ~1000’s cycle

Architecture consideration(1) In order to obtain better performance, we

need to understand how the work is done inside.

Cache Cache line (cache block, e.g. 64bytes)

Data moves between memory and caches in cache line. Shared caches or separate caches between cores Cache miss is very costly. Cache coherency when they are separate. Replacement policies such as LRU

Architecture consideration(2) Memory management

Paging Translation look-aside table (TLB)

Inside CPU Registers

False sharing

Assume the cache line is 64 bytes. What happens if two threads try to execute at the same time?

Thread 1

int a[1000];

int b[1000];

a[998] = i * 1000;

Thread 2

int a[1000];

int b[1000];

b[0] = i;

Poor cache utilization

What is the difference between the following two codes?

int a[1000][1000];

for (i=0; i<100; ++i)

for (j=0; j<1000; ++j)

a[i][j] = i*j;

int b[1000][1000];

for (i=0; i<100; ++i)

for (j=0; j<1000; ++j)

b[j][i] = i*j;

Poor Cache Utilization - with eggs

Good Cache Utilization – with eggs

1 Multithreaded Programming Concepts 2010. 3. 12 Myongji University Sugwon Hong 1.

processor core

singlecore processors

dualcore processor

single core smtmultiple

core technologysource

intel quadcore xeon

intel dualcore xeon

intel academy program

Documents

Multithreaded Programming

2014 Myongji Intensive Korean Language and … Myongji...

Multithreaded Programming Quickstart

Chapter 4: Multithreaded Programming

Multithreaded Programming - Chetana Hegde Programming 2 •....

Đại học Myongji - MJU

MYONGJI UNIVERSITY FACT SHEET - LAU | Student...

MYONGJI - MJU

Multithreaded Architectures

Singleton in multithreaded environment

Multithreaded Primer 2011-11-29 -...

Myongji University - Trường Đại Học Myongji Hàn...

MYONGJI UNIVERSITY, SEOUL - BAU International |...

A practical guide to writing multithreaded code - Mark...

Multithreaded sparse QR

Multithreaded sockets c++11