PowerPoint Presentation
Computer ArchitectureClip 7 Architectural ParallelismSchool of
Computer Science & CommunicationsP. Ienne (charts), Ph. Janson
(commentary)
Information, Computing & CommunicationICC Module 3 Lesson 1
Computer Architecture# / 9 2015 Ph. JansonThis video clip is part
of the E.P.F.L. introductory course on Information, Computing, and
Communication.It is the seventh and last one in a set of video
clips on computer architecture.
1OutlineClip 0 IntroductionClip 1 Software technology Assembler
languageAlgorithmsRegistersData instructionsInstruction
numberingControl instructionsClip 2 Hardware architecture Von
Neumanns stored program computer architectureData storage and
processingControl storage and processingClip 3 Hardware design
Instruction encodingHarware implementation Transistor
technologyClip 4 Computing circuitsClip 5 Memory circuitsHardware
performanceClip 6 Logic parallelismClip 7 Architecture
parallelismFirst clipPrevious clipNext clip
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonLike the previous video clips it gives another example of how
computer performance can be gained, in this case by architectural
rather than logic tricks.2Two simple examples of performance
increase:At the circuit levelReducing the delay of an adderAt the
processor structure levelIncreasing the throughput of instructions
=> this clipHow can one increase performance beyond transistor
speed ?t= Reduce delaywaiting to get a result= Increase
throughputnumber of results per time unitt
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonThe performance of anything can always be increased in one of
two ways:- either by reducing the time or delay it takes to do
something- or by increasing the throughput or the number of things
done by unit of time-The previous clip gave an example of computer
performance gain through reduced delay of logic operations.-This
clip gives an example of computer performance gain through
increased throughput at the architectural level.
3Our processor Arithm.unit103: loadr1, 0104: loadr2, -21105:
addr3, r7, r4106: multr2, r5, r9107: subr8, r7, r9108: loadr9,
r4109: addr3, r2, r1110: subr5, r3, r4111: loadr2, r3112: addr1,
r2, -1113: addr8, r1, -1114: divr4, r1, r7115: loadr2, r4Loadr1,
0Loadr2, -21Addr3, r7, r4Multr2, r5, r9Subr8, r7, r9Loadr9, r4
executes normallyone instruction at a timeCan we do better ?
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonA simple Von Neumann processor, such as discussed in
preceding video clips normally executes instructions one at a time
as shown here.4103: loadr1, 0104: loadr2, -21105: addr3, r7, r4106:
multr2, r5, r9107: subr8, r7, r9108: loadr9, r4109: addr3, r2,
r1110: subr5, r3, r4111: loadr2, r3112: addr1, r2, -1113: addr8,
r1, -1114: divr4, r1, r7115: loadr2,
r4Arithm.unitArithm.unitDoubling the throughput of our
processorLoadr1, 0Loadr2, -21Addr3, r7, r4Multr2, r5, r9Subr8, r7,
r9Loadr9, r4We could imagine executingtwo instructions at a time!Do
you see the problem ?!
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonOne can however imagine an enhanced processor in which more
transistors are invested to build a second arithmetic unit and
execute two operations at a time in parallel as shown here, thus
doubling the throughput of the system - but do you then foresee a
problem?5103: loadr1, 0104: loadr2, -21105: addr3, r7, r4106:
multr2, r5, r9107: subr8, r7, r9108: loadr9, r4109: addr3, r2,
r1110: subr5, r3, r4111: loadr2, r3112: addr1, r2, -1113: addr8,
r1, -1114: divr4, r1, r7115: loadr2,
r4Arithm.unitArithm.unitDoubling the throughput of our processorAdd
r3, r2, r1Add r5, r3, r4The problem is that the 2nd instruction
needs a valuecomputed by the 1st instruction!Unless one is careful
the result will be wrong !Do you see the problem ?!
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonDo you now see the problem?-The problem is that there is a
sequencing dependency between the next two instructions.The second
one follows the first one in the originally sequential program for
the very good reason that it needs a result from the first
one.Unless one pays attention to that the computation will produce
wrong results and possibly crash the computer or worse cause
something to break beyond the computer.6103: loadr1, 0104: loadr2,
-21105: addr3, r7, r4106: multr2, r5, r9107: subr8, r7, r9108:
loadr9, r4109: addr3, r2, r1110: subr5, r3, r4111: loadr2, r3112:
addr1, r2, -1113: addr8, r1, -1114: divr4, r1, r7115: loadr2,
r4Arithm.unitArithm.unitDoubling the throughput of our
processorAddr3, r2, r1Subr5, r3, r4Loadr2, r3Addr1, r2, -1Addr8,
r1, -1Divr4, r1, r7NOTHINGNOTHINGIn practice one executesbetween
one and two instructionsat a time and then the result is
correct
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonThe solution requires that an intelligent arbiter between the
two arithmetic units detect when the second instruction depends on
a result of the first one and in those cases execute them one after
the other as in a basic computer with a single arithmetic
unit.-Thus in practice one cannot quite double the throughput of
the computer.One can only approach a doubling while having to
concede a small waste every time the computer hits a sequential
dependency.
7A superscalar
processorArithm.unitArithm.unitArithm.unitArithm.unitRegister
bankDependency detectionAll modern processors for portable
computers as well as servers include thisin addition they reorder
and execute instructions before knowing whether they need to be
(for instance after an instruction such as jump_lte)
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonSo-called super-scalar processors can be built with two, four
or even more parallel units to increase their throughput and thus
speed up their computation.All modern computers do this.
In addition, since they cannot know in advance what the outcome
of a condition jump might be, when they hit one they perform in
parallel the testing of the condition as well as BOTH of the
instructions that could follow the test, whether it comes out true
or false.After the test and both instructions have been executed,
the result of the instruction not corresponding to the actual
outcome of the test is thrown away as if it had never been
computed.
As in the case of investing in more transistors and power
consumption to achieve faster operations, achieving higher
throughput with more arithmetic units also costs more transistors
and of course consumes more power, some of which is even
wasted.8One can modify the structure of a system to execute
programs fasterOne can add resources to processors to make then
fasterOr one can use simpler processors to spare energy
Performance engineering (2)This is an example of computer
architecture, which is another branch of Computer Engineering
ICC Module 3 Lesson 1 Computer Architecture# / 9 2015 Ph.
JansonIn conclusion of this last clip, one can modify the
architecture of a system to increase its throughput.This costs
additional transistors and power consumption.On the other hand one
can also build simpler processor structures to save power.
These techniques belong to computer architecture, which is
another branch of computer engineering.9