Advanced Computer Organiztion

Seminar on : ADVANCED COMPUTER ARCHITECTURE

GUIDED BY: PROF S.R.WANKHADE DELIVERED BY: DHARANA PODDAR (13007057) PRANOTI TARAPURE(13007070) SHIVANI BORGAONKAR(13007071)

ADVANCED COMPUTER ARCHITECTURE

Block Diagram

AMD MULTICORE OPTERON•Opteron is AMD's x86 server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture.

•In April 2005, AMD introduced its first multi-core Opterons. At the time, AMD's use of the term multi-core in practice meant dual-core; each physical Opteron chip contained two processor cores.

•This effectively doubled the computing performance available to each motherboard processor socket.

•One socket could then deliver the performance of two processors, two sockets could deliver the performance of four processors, and so on.

•Because motherboard costs increase dramatically as the number of CPU sockets increase, multicore CPUs enable a multiprocessing system to be built at lower cost.

•Second-generation Opterons are offered in three series: the 1000 Series (single socket only), the 2000 Series (dual socket-capable), and the 8000 Series (quad or octo socket-capable).

•AMD announced its third-generation quad-core Opteron chips on September 10, 2007.

•Based on a core design codenamed Barcelona, new power and thermal management techniques were planned for the chips. Earlier dual core DDR2 based platforms were upgradeable to quad core chips.

Quad-core "Barcelona" Opteron

Six-core "Istanbul" Opteron

•The fourth generation was announced in June 2009 with the Istanbul hexa-cores.

•It introduced HT Assist, an additional directory for data location, reducing the overhead for probing and broadcasts. HT Assist uses 1 MB L3 cache per CPU when activated.

ULTRA SPARC T1Introduction :•Sun Microsystems Ultra SPARC T1 microprocessor, known until its 14 November 2005 announcement by its development "Niagara", is amultithreading, multicore CPU .

•Designed to lower the energy consumption of server computers , the CPU typically uses 72 W of power at 1.4 GHz.

•Sun has produced two previous multicore processors (UltraSPARC IV and IV+), but UltraSPARC T1 is its first microprocessor that is both multicore and multithreaded..

•The processor is available with four, six or eight CPU cores, each core able to handle four threads concurrently. Thus the processor is capable of processing up to 32 threads concurrently.

Cores:•The Ultra SPARC T1 was designed from scratch as a multi-threaded, special-purpose processor, and thus introduces a whole new architecture for obtaining high performance.

•The T1 cores largely side-step the issue of cache misses by multithreading. Each core is a barrel processor, meaning it switches between available threads each cycle

Physical Characteristics:•The UltraSPARC T1 contained 279 million transistors and had an area of 378 mm2.

•It was fabricated by Texas Instruments(TI) in their 90 nm complementary metal–oxide–semiconductor (CMOS) process with nine levels of copper interconnect.

•Each core has L1 16kB instruction cache and 8KB data cache. L2 cache is 3MB and there is no L3 cache.

IBM CELL BROADBAND ENGINE •The Cell Broadband Engine (Cell BE) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM.

•The Cell BE includes one Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs). •The approach taken by the Cell BE design was to focus on improving performance/area and performance/power ratios .

•These goals are largely achieved by using powerful, yet simple cores that use area more efficiently with less power dissipation.

Fig. IBM CBE Block Diagram

The POWER Processing Element:•The PPE consists of a POWER Processing Unit (PPU) connected to a 512KB L2 cache. •The PPE is the main processor of the Cell BE, and is responsible for running the operating system and coordinating the SPEs.•The key design goals of the PPE are to maximize the performance/power ratio as well as the performance/area ratio.•The PPE core can fetch four instructions at a time, and issue two.

• The SPE is a modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC)

• The SPU implements a new SIMD instruction set, the SPU Instruction Set Architecture, that is specific to the Broadband Processor Architecture.

• Each SPU is an independent processor with its own program counter and is optimized to run SPE threads spawned by the PPE.

Synergistic Processing Element:

GRAPHICAL PROCESSING UNIT:

WHAT IS GPU?•It is a processor optimized for 2D/3D graphics, video, visual computing, and display.•It is highly parallel, highly multithreaded multiprocessor optimized for visual computing.•It provide real-time visual interaction with computed objects via graphics images, and video.•It serves as both a programmable graphics processor and a scalable parallel computing platform.•Heterogeneous Systems: combine a GPU with a CPU

GPU vs CPU:•A GPU is tailored for highly parallel operation while a CPU executes programs serially•For this reason, GPUs have many parallel execution units and higher transistor counts, while CPUs have few execution units and higher clockspeeds.•A GPU is for the most part deterministic in its operation (though this is quickly changing)

•GPUs have much deeper pipelines (several thousand stages vs 10-20 for CPUs)•GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs.

THE GPU PIPELINE•The GPU receives geometry information from the CPU as an input and provides a picture as an output•Let’s see how that happens

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

HOST INTERFACE•The host interface is the communication bridge between the CPU and the GPU•It receives commands from the CPU and also pulls geometry information from system memory•It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc)

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

VERTEX PROCESSING•The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space•This may be a simple linear transformation, or a complex operation involving morphing effects•Normals, texcoords etc are also transformed•No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

TRIANGLE SETUP•In this stage geometry information becomes raster information (screen space geometry is the input, pixels are the output)•Prior to rasterization, triangles that are backfacing or are located outside the viewing frustrum are rejected•Some GPUs also do some hidden surface removal at this stage

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

TRIANGLE SETUP (CONT)•A fragment is generated if and only if its center is inside the triangle

•Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

FRAGMENT PROCESSING•Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcoord etc), which are used to compute the final color for this pixel•The computations taking place here include texture mapping and math operations•Typically the bottleneck in modern applications

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

MEMORY INTERFACE•Fragment colors provided by the previous stage are written to the framebuffer•Used to be the biggest bottleneck before fragment processing took over•Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests•On modern GPUs, z and color are compressed to reduce framebuffer bandwidth (but not size)

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

THANK YOU!!

Advanced Computer Organiztion

Engineering