Chapter 5 General Architecture Issues: Real Computers.

Chapter 5

General Architecture Issues:

Real Computers

5.1 The limitations of a Virtual Machine

● The JVM Very simple and easily understandable

architecture. It ignores some of the real-world limitations of

actual computer chips.● On a physical computer, there is normally only one

CPU and one bank of main memory, which means that two functions running at the same time inside the CPU might compete for registers, memory storage, and so forth.

5.1 The limitations of a Virtual Machine

● Machine capacity The PowerPC has only 32 registers. The Windows (Pentium) PC has even fewer.

5.2.1 Building a Better Mousetrap

● Increasing the word size of the computer will increase the overall performance numbers. Increasing the clock speed should results in

increasing in machine performance. In practical terms, this is rarely effective.

● Almost all machines today are 32 bits.● Increasing to a 64-bit register would let the

programmer do operations involving numbers in the quadrillions more quickly—but how often do you need a quadrillion of anything?

● Making a faster CPU chip might not help if the CPU now can process data faster than the memory and bus can deliver it.

5.2.1 Building a Better Mousetrap

● Increasing performance this way is expensive and difficult. Performance improvements can be made within

the same general technological framework.

5.2.2 Multiprocessing

● One fundamental way to make computers more useful is to allow them to run more than one program at a time. CPU time-sharing.

● You get to use the equipment for one slice.● After that slice is done, someone else comes in.● The computer must be prepared to stop the program

at any point, copy wall the program-relevant information (the state of the stack, local variables, the current program counter, etc.) into make memory somewhere, then load another program's relevant information from a different area.


● As long as the time slices are kept separate and the memory areas are kept separate the computer appears to be running several different programs at once. Each separate program needs to be able to run

independently of each other, and each separate program needs to be prevented from influencing other programs.

The computer needs to have a programmatic way to swap user programs in and out of the CPU at appropriate times.


● The operating systems primary job is to act as a program control program and enforcer of the security rules. The operating system is granted privileges and

powers including the ability to interrupt a running program, the ability to write to an area of memory irrespective of the program using it, and so forth.

These powers are often formalized as programming models and define the difference between supervisor and user level privileges and capacities.

5.2.3 Instruction set Optimization

● A particular instruction that occurs very frequently might be “tuned” in hardware to run faster than the rest of the instruction set would lead you to expect. e.g. Iload_0 is shorter (one byte vs. two) and

faster than the equivalent iload 0. On the multiprogramming system “save all local

variables to main memory” might be a commonly-performed action.


● Good graphics performance demands a fast way of moving a large block of data directly from memory to the graphics card. The ability to perform arithmetic operations on

entire blocks of memory (for example, to turn the entire screen orange in a single operation) is part of the basic instruction set of some of the later Intel chips.


● By permitting parallel operations to proceed at the same time (this kind of parallel operation is called SIMD parallelism, an acronym for “Same Instruction, Multiple Data”), the effective speed of a program can be greatly increased.

5.2.4 Pipelining

● To do more than one different instruction at a time the CPU has a much more complex, pipelined, fetch-execute cycle that allows it to process several different instructions at once. Operations can be processed in a sort of

assembly-line fashion. Instead of putting cars together one at a time,

everyone has a single well-defined job and thousands of cars are put together via tiny steps.

5.2.4 Pipelining

● While part of the CPU is actually executing one instruction, a different part of the CPU could already be fetching a different instructions. By the time the instruction finishes executing, the

next instruction is already here and available to be executed.

Instruction pre-fetch is where an instruction is fetched before the CPU actually needs it, so it's available at once.

5.2.4 Pipelining

● This doesn't improve the latency—each operation still take the same amount of time from start to finish—but can substantially improve the throughtput, the number of instructions that can be handled per second by the CPU as a whole.

5.2.4 Pipelining

● Unpipelined laundry

5.2.4 Pipelining

● Pipelined laundry

5.2.4 Pipelining

● Fetch stage Instruction is loaded from main memory.

● Dispatch stage Analyze what kid of instruction it is.

● Get the source arguments from the appropriate locations.

● Prepare the instruction for actual execution by the third execute stage of the pipeline.

● Complete/writeback stage Transfer the results of the computation to the

appropriate registers. Update the overall machine state as necessary.

5.2.4 Pipelining

5.2.4 Pipelining

● Pipeline can only run as fast as it slowest stage. When one of these instructions needs to be

executed, it can cause a blockage (sometimes called a “bubble” in the pipeline as other instructions pile up behind it.

Each pipeline-stage should take the same amount of time.

5.2.4 Pipelining

● “jump if less than” Once this instruction has been encountered the

next instruction will come either from the next instruction in sequence, or else from the instruction at the target of the the jump—we may not know which.

The condition depends on the results of a computation somewhere ahead of us in the pipeline and therefore unavailable.

5.2.4 Pipelining

● Returns from subrountines create their own problems. The computer may have no choice but to stall

the pipeline until it is empty. Branch prediction is the art of guessing whether

or not the computer will take a given branch (and to where).

If the guess is wrong, these locations (and the pipeline) are flushed and the computer restarts with an empty pipeline.

No worse than having to stall the pipeline.

5.2.4 Pipelining

● Since most loops are executed more than once the branch will be taken many, many times and not taken once. A guess of “take the branch” in this case could

be accurate 99.9% of the time without much effort.

By adapting the amount and kind of information available, engineers have gotten very good (well above 90%) at their guessing, enough to make pipelining a crucial aspect of modern design.

5.2.5 Superscalar Architecture

● Superscalar processing performs multiple different instructions at once. Instruction queue

● Instead of just loading one instruction at a time, we instead have an a queue of instructions waiting to be processed.

● This is an example of MIMD (Multiple Instruction Multiple Data) parallelism—while one pipeline is performing one instruction (perhaps a floating point multiplication) on a piece of data, another pipeline can be doing an entirely different operation on entirely different data.

5.3 Optimizing Memory

● Data the computer needs should be available as quickly as possible.

● The memory should be protected from accidental re-writing.

5.3.1 Cache Memory

● 32-bit word size, each register can hold 232. This allows up to about four gigabytes of

memory to be used. The program generally is only using a small

fraction of memory at any given instant. Because speed is valuable, the fastest memory

chips also cost the most. Most real computers use a multi-level memory

structure. CPU run at 2 or 3 gigahertz, most memory chips

are substantially slower, four hundred times slower than the CPU.

5.3.1 Cache Memory

● Cache memory Frequently and recently used memory locations

are copied into cache memory so that they are available more quickly when the CPU needs them.

● Level one (L1) cache is built into the CPU chip itself and runs at CPU speed.

● Level two (L2) cache is a special set of high-speed memory chips placed next to the CPU on the motherboard.

5.3.2 Memory Management

● Rather than referring to specific physical locations in memory, the program refers to a particular logical address which is reinterpreted by the memory manager to a particular physical location, or possibly even on the hard disk. Memory management for many computers

provide hardware support in the interest of speed, portability, and security.

5.3.2 Memory Management

● User-level programs can just assume that logical addresses are identical to physical addresses, and that any bit patterns of appropriate length represents a memory location somewhere in physical memory, even if the actual physical memory is considerably larger or smaller than the logical address space. Under the hood is a sophisticated way of

converting logical memory addresses into appropriate physical addresses.

5.3.3 Direct Address Translation

● Direct address translation occurs when hardware address translation has been turned off (only the supervisor can do this). Only 4GB of memory can be accessed. Done only in the interests of speed on a special

purpose computer expected to only be running one program at once.

5.3.4 Page Address Translation

● Virtual address space We could define a set of 24-bit segment registers

to extend the value address value. The top four bits of the logical address will define

and select a particular segment register. The value stored in this register defines a

particular virtual segment identifier (VSID) of 24 bits.

The virtual address is obtained by concatenating the 24-bit VSID with the lower 28 bits of the logical address.

Creates a new 52 bit address.


● 0x13572468 (a 32-bit address) #1 0xAAAAAA

● 52-bit VSID 0xAAAAAA3572468 Two different programs, accessing the same

logical location would nevertheless get two separate VSID


● 252 bytes of memory (4 million gigabytes, or 4 petabytes). Physical memory is divided into pages of 4196

(212) bytes each. Each 52-bit virtual address can be thought of as

a 10-bit page identifier.● The computer stores a set of “page tables” in essence

a hash table that stores the physical location of each page as a 20-bit number.

The 40-bit page identifier is thus converted, via a table lookup, to a 20-bit physical page address.

32-bit physical address is the page address plus the offset.


5.4.1 The Problem with busy-waiting.

● To get the best performance out of peripherals, they should not be permitted to prevent the CPU from doing other useful stuff. A good human typist can type at about 120

words per minute. A 1GHz computer can add 100,000,000

numbers together between two keystrokes. Polling checks to see if anything useful has

happened at periodic intervals.


While (no key is pressed)Wait a little bit

Figure out what the key was and do something


● Polling (or busy-waiting)is an inefficient use of the CPU because the computer is being “busy” waiting for the key to be pressed and can't do anything else useful.

5.4.2 Interrupt Handling

● Set up a procedure to follow when the event occurs, and then to do whatever else needs doing in the meantime. When the event happens, one will then interrupt

the current task to deal with the event using the previously established procedure.

The CPU established several different kinds of interrupt signals that are generated under pre-established circumstances such as the press of a key.


● The normal fetch-execute cycle is changed slightly. Instead of loading and executing the “next”

instruction the CPU will consult a table of interrupts Control is then transferred to that location and the special interrupt handler will be executed to do whatever is needful.

At the end of the interrupt handler, the computer will return to the main task at hand.


● The possible interrupts for a given chip are numbered from zero to a small value like 10. These numbers also correspond to locations

programmed into the interrupt vector—when interrupt number 0 occurs, the CPU will jump to location 0x00 and execute whatever code is stored there.

Usually all that is stored in the actual interrupt location itself is a single JMP instruction to transfer control (still inside the interrupt handler) to a larger block of code that does the real work.


● Interrupt handling mechanism can handle system-internal events. For example, the time-sharing aspect of the CPU

can be controlled by setting an internal timer. When the timer expires, an interrupt will be

generated, causing the machine, first, to switch from user to supervisor mode, and second, to branch to an interrupt handler that swaps the programming context for the current program out, and the context for the next program in.

The timer can then be reset and computation resumed for the new program.

5.4.3 Communicating with the peripherals: using the Bus

● Data must move between the CPU, memory and peripherals using one or more buses. You would like it to be as fast as possible. A bus is usually just a set of wires, and so

connects all the components together at the same time.

Where every peripheral gets the same message at the same time.

Only one device can be using the bus at once. To use a bus successfully requires discipline

from all parties involved.

5.4.3 Communicating with the peripherals: using the Bus

● A typical bus protocol might involve the CPU sending a START and then and identifier for a particular device.

● Only the specific device will respond with some sort of ACKNOWLEDGE message.

● All other devices have been warned by this START message not to attempt to communicate until the CPU finishes and send a similar STOP message.

5.5 Chapter Review

● The JVM is freed from some practical limitations.

● Engineers have found many techniques to squeeze better performance out of their chips.

● One way to get better user-level performance is by improving the basic numbers of the chip, but this is usually a difficult and expensive process.

5.5 Chapter Review

● Another way to improve the performance of the system is to allow it to run more than one program at a time.

● Engineers create special-purpose instructions and hardware specifically to support those programs.

● Performance can also be increased by parallelism. We can distinguish SIMD parallelism from MIMD

parallelism in terms of the flexibility of what kind of instruction can be simultaneously executed.

5.5 Chapter Review

● Pipelining where the fetch/execute cycle is broken down into several stages, each of which are independently executed.

● Superscalar architecture provides another way to speed up processing by doing the same thing several times over.

● Memory access times can be improved by using cache memory.

5.5 Chapter Review

● Virtual memory and paging indexpage can provide computers with access to greater amounts of memory more quickly and securely.

● The use of interrupts can give substantial performance increases when using peripherals.

● A suitable design of a bus protocol can speed up how fast data moves around the computer.

Chapter 5 General Architecture Issues: Real Computers.

Documents

program control program

separate program

running program

programrelevant information

area of memory

current program counter

memory areas

memory storage