Multicore Processor Report

8/7/2019 Multicore Processor Report

1/19

[1]

1.INTRODUCTIONA processor is a unit that reads, decodes and executes the program instructions. The

processors were originally developed with only one core. A core is a part of processor that

actually performs1. Fetching

2. Decoding

3. Executing an instruction as shown in Fig 1.

Fig. 1.1( Single Core Computer)

Single core processor is a processing system; is an Integrated Circuit (IC) which allows

two or more individual and independent cores have been attached, on a single die. Placing

two or more powerful computing cores on a single processor opens up a world of

important new possibility. Each core has its own complete set of resources and may share

on-die cache layers.

A single core processor can process only one instruction at a time. To improve the

efficiency, processor commonly utilizes pipelines internally, which allow several

instructions to be processed together. However they are still consumed into the pipeline at

a time.


2/19

[2]

Fig.1.2 (Single Core)

So it laid to evolution of Multi core processor; placing two or more powerful computing

cores on a single processor opens up a world of important new possibility to increase the

performance of the system.

Need of multi core processors:The difficulties in using a single core CPU gave birth to using the Multi core

processors.

1. Difficult to make Single core clock frequency even higher in cost .

Fig 1.3.(Clock Frequency and Performance)

The difficulty in raising clock frequency further results in improvement of performance

but decreases reliability.

Doubling the frequency causes fourfold increase in power consumption. Calculated as

Power = Capacitance * Voltage * Frequency.

2. Many New applications are Multi Threaded


3/19

[3]

3. General trend in Computer Architecture now-a-days shift towards Parallelism .Deeply pipelined circuits which would lead to

a. Heat Problemsb. Speed of Light Problemsc.

Large Design Teams necessary

d. Server Farm Need Expensive Air ConditioningTo overcome the above drawbacks of single core processor and to increase the

performance of the system without increasing the power consumptions and with less

complexity the need of Multi core processor raised.


4/19

[4]

2.PROCESSOR HISTORY

Intel manufactured the first microprocessor, the 4-bit 4004, in the early 1970s

which was basically just a number-crunching machine. Shortly afterwards they

developed the 8008 and 8080, both 8-bit, and Motorola followed suit with their 6800

which was equivalent to Intels 8080. The companies then fabricated 16-bit

microprocessors, Motorola had their 68000 and Intel the 8086 and 8088; the former would

be the basis for Intels 80386 32-bit and later their popular Pentium lineup which were in

the first consumer-based PCs. [18, 19] Each generation of processors grew smaller, faster,

dissipated more heat, and consumed more power.

Fig 2.1(generation of processors)


5/19

[5]

2.1 MOOREs LAW

One of the guiding principles of computer architecture isknown as Moores Law. In

1965 Gordon Moore stated that the number of transistors on a chip will roughly double

each year.What is often quoted as Moores Law is Dave Houses revision that computer

performance will double every 18months. [20] The graph in Figure 2.2 plots many of the

early microprocessors against the number of transistors per chip.

Fig 2.2(microprocessors against the number of transistors per chip)

Throughout the 1990s and the earlier part of this decade microprocessor frequency

was synonymous with performance; higher frequency meant a faster, more capable

computer. Since processor frequency has reached a plateau, we must now consider other

aspects of the overall performance of a system: power consumption, temperature

dissipation, frequency, and number of cores. Multicore processors are often run at slower

frequencies, but have much better performance than a single-core processor because two

heads are better than one.


6/19

[6]

3. ARCHITECTUREThe processor or cores are implemented on a single die or chip .Each core has its

own complete set of resources, and may share the on-die cache layers.

Fig 3.1(architecture of multicore processor)

Core:

The individual processors that are implemented on the integrated die or chip are

called the cores. The core is the part of the processor that actually performs the reading

and executing of instructions.

Register file:A register file is an array of processor registers in a central processing unit (CPU).

Modern integrated circuit-based register files are usually implemented by way of fast

static RAMs with multiple ports. Such RAMs are distinguished by having dedicated read

and write ports, whereas ordinary multiported SRAMs will usually read and write through

the same ports.

Bus:

Back side bus connects the processor with cache memory.

Cache:

Closest to the processor is Level 1 (L1) cache; this is very fast memory used to store data

frequently used by the processor. Level2 (L2) cache is just off-chip, slower than L1 cache,

but still much faster than main memory; L2 cache is larger than L1 cache and used for the

same purpose.The cores not necessarily share the cache.


7/19

[7]

Cross bar:

Cross bar switch is a switch connecting multiple inputs to multiple outputs in a matrix

manner. Here the cross bar switch is used to connect the system request queue (SRQ) and

the integrated memory controller. Here it directly connects both CPU cores to theHyperTransport link, as well as the integrated memory controller, for I/O to and from the

outside world. Think of it like a train-track switch - signals can pass to/from either core

and the outside world, but not at the same time.

Hyper transport link:

Hyper-transport technology is a technology for interconnection of computer

processors. It is a bidirectional serial/parallel high-bandwidth, low-latency point-to-point

link. This is replacement for front side bus.

Fig 3.2[multicore processor implemented with 4 independent processor]

Integrated memory controller:

The memory controller is a digital circuit which manages the flow of data going toand from the main memory.

System request queue:

The System Request Queue provides an interface for the CPU cores to the crossbar, and it

is what keeps things operating smoothly. The System Request Queue manages and


8/19

[8]

prioritizes both CPU cores' access to the crossbar switch, minimizing contention for the

system bus. The result is a very efficient use of system resources.

3.1CORE COMPONENTS

.Pipeline:

One widely accepted technique for improving the performance of serial software

tasks is pipelining. Simply put, pipelining is the process of dividing a serial task into

concrete stages that can be executed in assembly-line fashion. In order to gain the most

performance increase possible from pipelining, individual stages must be carefully

balanced so that no single stage takes a much longer time to complete than other

stages.Deeper pipeline buys frequency at expense of increased cache miss penalty and

lower instructions per clock. Shallow pipeline gives better instructions per clock at the

expense of frequency scaling. Max. frequency per core requires deeper pipelines

Cache:

With the rising gap between processor and memory speed, maximizing on-chip

cache capacity is crucial to attaining good performance. Memory system designers employ

hierarchies of caches to manage latency. Many of todays multicore processors assume

private L1 caches and a shared L2 cache. At some point, however, a single shared L2

cache will require additional levels in the hierarchy. One option designers can consider is

implementing a physical hierarchy that consists of multiple clusters, where each clusterconsists of a group of processor cores that share an L2 cache. The effectiveness of such a

physical hierarchy, however, may depend on how well the applications map to the

hierarchy. Cache size buys performance at expense of die size. Deep pipeline cache miss

penalties are reduced by larger caches.


9/19

[9]

4.SIMULTANEOUSMULTITHREADING WITH

MULTICORE ARCHITECTURE

Simultaneous multithreading, often abbreviated as SMT, is a technique forimproving the overall efficiency of superscalar CPUs with hardware multithreading. SMT

permits multiple independent threads of execution to better utilize the resources provided

by modern processor architectures.

In simultaneous multithreading, instructions from more than one thread can be

executing in any given pipeline stage at a time. This is done without great changes to the

basic processor architecture: the main additions needed are the ability to fetch instructions

from multiple threads in a cycle, and a larger register file to hold data from multiple

threads. The number of concurrent threads can be decided by the chip designers, but

practical restrictions on chip complexity have limited the number to two for most SMTimplementations.

Without the implementation of simultaneous multithreading only a single thread

can run at a time, whether it deals with different functional unit i.e integer unit or floating

point unit. Using simultaneous multithreading the two functional units can execute two

different threads concurrently. More important, however, two programs could now run

simultaneously on a processor without having to be swapped in and out. To induce the

operating system to recognize one processor as two possible execution pipelines, the new

chips were made to appear as two logical processors to the operating system.

Fig 4.1 [early processor : fig 4.2[processor with simultaneous

One chip,one core multithreading: one chip,one

One executing thread ] core ,two executing thread ]


10/19

[10]

The performance of simultaneous multithreading was limited by the availability of shared

resources to the two executing threads. As a result, SMT Technology cannot approach the

processing throughput of two distinct processors because of the contention for these

shared resources. To achieve greater performance gains on a single chip, a processorwould require two or more separate cores, such that each thread would have its own

complete set of execution resources.

Fig 4.3[multicore : one chip , many cores, several threads of execution ]

In this design, each core has its own execution pipeline. And each core has the resources

required to run without blocking resources needed by the other software threads. While

the example in Figure shows a four-core design, there is no inherent limitation in the

number of cores that can be placed on a single chip. The multi-core design enables two or

more cores to run at somewhat slower speeds and at much lower temperatures. The

combined throughput of these cores delivers processing power greater than the maximum

available today on single-core processors and at a much lower level of power

consumption.


11/19

[11]

5.PERFORMANCE ANALYSIS

A multicore arrangement that provides two or more low-clock speed cores could be

designed to provide excellent performance while minimizing power consumption and

delivering lower heat output than configurations that rely on a single high-clock-speedcore.

The following example shows how multicore technology could manifest in a

standard server configuration and how multiple low-clock-speed cores could deliver

greater performance than a single high-clock-speed core for networked applications. This

example uses some simple math and basic assumptions about the scaling of multiple

processors and is included for demonstration purposes only. Until multicore processors are

available, scaling and performance can only be estimated based on technical models. The

example described in this article shows one possible method of addressing relative

performance levels as the industry begins to move from platforms based on single-core

processors to platforms based on multicore processors. Other methods are possible, and

actual processor performance and processor scalability are tied to a variety of platform

variables, including the specific configuration and application environment. Several

factors can potentially affect the internal scalability of multiple cores, such as the system

compiler as well as architectural considerations including memory, I/O, front side bus

(FSB), chip set, and so on.

For instance, enterprises can buy a dual-processor server today to provide e-mail,

calendaring, and messaging functions. Dual-processor servers are designed to deliver

excellent price/performance for messaging applications. In our example we use dual 3.6

GHz processors supporting simultaneous multithreading Technology.

The following simple example can help explain the relative performance of a low-

clock-speed, dual-core processor versus a high-clock-speed, dual-processor counterpart.

Dual-processor systems available today offer a scalability of roughly 80 percent for the

second processor, depending on the OS, application, compiler, and other factors. That

means the first processor may deliver 100 percent of its processing power, but the second

processor typically suffers some overhead from multiprocessing activities. As a result, the

two processors do not scale linearlythat is, a dual-processor system does not achieve a

200 percent performance increase over a single-processor system, but instead provides

approximately 180 percent of the performance that a single-processor system provides. In

this article, the single-core scalability factor is referred to as external, or socket-to-socket,

scalability. When comparing two single-core processors in two individual sockets, the

dual 3.6 GHz processors would result in an effective performance level of approximately

6.48 GHz (see Figure).


12/19

[12]

Fig 5.1(Sample core speed and anticipated total relative power in a system using two

single-core processors)

For multicore processors, administrators must take into account not only socket-to-socket

scalability but also internal, orcore-to-core, scalabilitythe scalability between multiple

cores that reside within the same processor module. In this example, core-to-core

scalability is estimated at 70 percent, meaning that the second core delivers 70 percent of

its processing power.

Thus, in the example system using 2.8 GHz dual-core processors, each dual-core

processor would behave more like a 4.76 GHz processor when the performance of the two

cores2.8 GHz plus 1.96 GHzis combined. For demonstration purposes, this example

assumes that, in a server that combines two such dual-core processors within the same

system architecture, the socket-to-socket scalability of the two dualcore processors would

be similar to that in a server containing two single-core processors80 percent

scalability. This would lead to an effective performance level of 8.57 GHz (see Figure).


13/19

[13]

Fig 5.2 (Sample core speed and anticipated total relative power in a system using two

dual-core processors)


14/19

[14]

6.MAPPING OF AN APPLICATION TO

MULTICORE PROCESSOR

Task parallelism is the concurrent execution of independent tasks in software. On a single-core processor, separate tasks must share the same processor. On a multicore processor,

tasks essentially run independently of one another, resulting in more efficient execution.

For mapping first comes Identifying a Parallel Task Implementation. Identifying

the task parallelism in an application is a challenge that, for now, must be tackled

manually. After identifying parallel tasks, mapping and scheduling the tasks across a

multicore system requires careful planning. A four-step process, derived from Software

Decomposition for Multicore Architectures [1], is proposed to guide the design of the

application:

1. Partitioning Partitioning of a design is intended to expose opportunities for parallel

execution. The focus is on defining a large number of small tasks in order to yield a fine-

grained decomposition of a problem.

2. Communication The tasks generated by a partition are intended to execute

concurrently but cannot, in general, execute independently. The computation to be

performed in one task will typically require data associated with another task. Data must

then be transferred between tasks to allow computation to proceed. This information flow

is specified in the communication phase of a design.

3. Combining Decisions made in the partitioning and communication phases are

reviewed to identify a grouping that will execute efficiently on the multicore architecture.

4. Mapping This stage consists of determining where each task is to execute.


15/19

[15]

7. APPLICATIONS

Database servers Web servers (Web commerce) Compilers Video editing Encoding. 3D gaming. Powerful graphics solution The full effect and the advantage of having a multi-core processor, when it is used

together with a multithreading operating.

Multimedia applications Scientific applications, CAD/CAM In general, applications with Thread-level parallelism

Fig 7.1 (two different processes running concurrently in two different cores)


16/19

[16]

8. ADVANTAGES AND DISADVANTAGES

8.1 ADVANTAGES

(i).Cache coherency: The proximity of multiple CPU cores on the same die allowsthe cache coherency circuitry to operate at a much higher clock-rate than is possible if the

signals have to travel off-chip. Combining equivalent CPUs on a single die significantly

improves the performance of cache snoop (alternative: Bus snooping) operations. Put

simply, this means that signals between different CPUs travel shorter distances, and

therefore those signals degrade less. These higher-quality signals allow more data to be

sent in a given time period, since individual signals can be shorter and do not need to be

repeated as often. (ii)less power requirement: The largest boost in performance will likely

be noticed in improved response-time while running CPU-intensive processes, like

antivirus scans, ripping/burning media (requiring file conversion), or file searching. Forexample, if the automatic virus-scan runs while a movie is being watched, the application

running the movie is far less likely to be starved of processor power, as the antivirus

program will be assigned to a different processor core than the one running the movie

playback.(iii).Reduced size of printed circuit board: Assuming that the die can fit into the

package, physically, the multi-core CPU designs require much less printed circuit board

(PCB) space than do multi-chip SMP designs. Also, a dual-core processor uses slightly

less power than two coupled single-core processors, principally because of the decreased

power required to drive signals external to the chip. Furthermore, the cores share some

circuitry, like the L2 cache and the interface to the front side bus (FSB). In terms ofcompeting technologies for the available silicon die area, multi-core design can make use

of proven CPU core library designs and produce a product with lower risk of design error

than devising a new wider core-design. Also, adding more cache suffers from diminishing

returns. (iv) Multi-tasking productivity: multi-core processor PC users will experience

exceptional performance while executing multiple tasks simultaneously. The ability to do

complex, multi-tasked workloads, such as creating professional digital content while

checking and writing e-mails in the foreground, and also running firewall software or

downloading audio files off the Web in the background, will allow consumers and

workers to do more work in less time (v). PC security can be enhanced because multicore

processors can run more sophisticated virus, spam, and hacker protection in the

background without performance penalties (vi) Cool and quiet : The enhanced

performance offered by multicore processors will come without the additional heat and

fan noise that would likely accompany performance increases with single-core processor

machines.


17/19

[17]

8.2. DISADVANTAGES

Maximizing the utilization of the computing resources provided by multi-core

processors requires adjustments both to the operating system (OS) support and to existing

application software.

Also, the ability of multi-core processors to increase application performance

depends on the use of multiple threads within applications.

Integration of a multi-core chip drives chip production yields down and they are

more difficult to manage thermally than lower-density single-chip designs.

From an architectural point of view, ultimately, single CPU designs may make

better use of the silicon surface area than multiprocessing cores, so a development

commitment to this architecture may carry the risk of obsolescence.

Using amulticore processor to its full potential is another issue. If programmers

dont write applicationsthat take advantage of multiple cores there is no gain, and in some

cases there is a loss ofperformance. Application need to be written so that different parts

can be run concurrently .They donot work as n times of single core processor due to

shocket to shocket scalability and core to core scalability.


18/19

[18]

9. CONCLUSION

In the next years the trend will go to multi-core processors more and more. The

main reason is that they are faster than single-core processors and they can be still

improved., but added interesting new problems. But in the future there will be still someapplications for single-core processors because not every system needs a fast processor.

Several new multi-core chips in design phases. Parallel programming techniques

likely to gain importance.


19/19

[19]

REFERENCES

[1] L Hammond, BA Nayfeh, K Olukotun, A Single-Chip Multiprocessor, IEEE, sept

1997

[2] P. Frost Gorder, Multicore Processors for Science and Engineering , IEEE CS,

March/April 2007

[3] D. Geer, Chip Makers Turn to Multicore Processors, Computer, IEEE Computer

Society, May 2005

[4] R. Merritt, CPU Designers Debate Multi-core Future, EETimes Online, February

2008, http://www.eetimes.com/showArticle.jhtml?articleID=206105179

[5] R. Merritt, X86 Cuts to the Cores, EETimes Online, September 2007,

http://www.eetimes.com/showArticle.jtml?articleID=202100022

Multicore Processor Report

Documents