GRADO EN INGENIERÍA DE COMPUTADORESatc2.aut.uah.es/~frutos/areinco/pdf/P_DATOS.pdf · Array de muchos procesadores simples, baratos y con poca memoria. ! ... Pueden no existir si
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Prof. Dr. José Antonio de Frutos Redondo Curso 2013-2014
GRADO EN INGENIERÍA DE COMPUTADORES Arquitectura e Ingeniería de Computadores
n Paralelismo de datos. n Computadores vectoriales.
n Procesamiento vectorial. n Arquitectura de computadores vectoriales. n Organización de los vectores. n Computadores vectoriales significativos. n Sistemas de memoria.
INITIALIZE I=1 10 READ B(I) READ C(I) ADD B(I) + C(I) STORE A(I) <= B(I)+C(I) READ A(I+1) MULTIPLY 2*A(I+1) STORE B(I) <= 2*A(I+1) INCREMENT I <= I+1 IF I .LE. N GO TO 10 STOP
n Registros vectoriales: n Contienen los operandos vectoriales en máquinas de registros. n No existen si la máquina es memoria-memoria. n Valores típicos de componentes son 64 o 128. n Deben tener al menos 2 puertos de lectura y uno de escritura.
n Unidades funcionales vectoriales: n Ejecutan las operaciones vectoriales. n Están segmentadas y suelen tener latencia 1. n Una unidad de control vigila las dependencias.
Elementos de la arquitectura (II) n Unidad de carga y almacenamiento:
n Gestiona transferencias de vectores desde/a memoria. n Puede estar segmentada. n También puede ocuparse de los datos escalares.
n Registros escalares: n Contienen los operandos escalares. n Se usan en operaciones vectoriales y para calcular direcciones. n Se necesitan varios puertos de lectura y escritura.
n Unidades funcionales escalares: n Pueden existir para operaciones específicamente escalares. n Pueden no existir si para operaciones escalares se usan las
n Solución: se indica el espaciado con un registro específico o con un registro de datos. Se pueden incluir instrucciones específicas para acceso a vectores espaciados, o hacer que las instrucciones de carga y almacenamiento de vectores siempre trabajen con espaciado.
n Ejemplo: en DLX, versión vectorial, existen las instrucciones: n LVWS V1,R1,R2: carga desde la dirección en R1 con el
espaciado que indica R2 en V1
n SVWS R1,R2,V1: almacena V1 a partir de la dirección en R1 con el espaciado que indica R2
n En 1979 aparece una versión mejorada CRAY 1S. n Primer computador basado en lógica ECL. n Periodo de reloj 12.5 ns (80 MHz). n Posee 10 cauces funcionales. Dos para direcciones. n Software COS (Cray Operating System) y Fortran 77 n 133 megaflops.
n 4 CPUs equivalentes a un CRAY 1 con memoria compartida. n Contiene un conjunto de registros compartidos para acelerar las comunicaciones entre CPUs. n Período de reloj de 8.5 ns. (120 MHz.) n 840 Mflops. n UNIX, UNICOS.
n CPU procesador vectorial de 9 Gflops n Hasta 8 CPUs y 128 GB por nodo de memoria compartida
SMP n Ancho de banda de memoria 36 GB/S n Ampliable hasta 128 nodos (1024 CPUs, 9.2 Tflops, 16 TB) n Red de alto rendimiento de barras cruzadas entre nodos (8
GB/s bidireccional) n Arquitectura de I/O ampliable con procesadores I/O
independientes n SUPER-UX sistema operativo basado en UNIX con
características especificas para supercomputación. n Entorno de programación paralelo con compiladores capaces
de vectorizar y optimizar Fortran 95 y C/C++. n Modelos de memoria distribuida y herramientas de
n Tipo de máquina: Multiprocesador vectorial de memoria distribuida. n Sistema operativo: UXP/V (una variante de Unix basada en la V5.4). n Estructura conectiva: Barras cruzadas. n Compiladores: Fortran 90/VP (compilador vectorial de Fortran 90),
Fortran 90/VPP (compilador vectorial paralelo de Fortran 90), C/VP (compilador vectorial de C), C, C++.
n Ciclo de reloj: 3.3 ns n Rendimiento pico teórico:
n 9.6 Gflop/s por procesador (64 bits). n 1.22 Tflop/s máximo (64 bits).
n Memoria principal n Memoria/nodo 16 GB n Memoria máxima 2 TB n Número de procesadores 4-128
n Ancho de banda de memoria 38.4 GB/s n Ancho de banda de comunicaciones 1.6 GB/s
n En una máquina vectorial, los accesos no son a datos individuales, sino a colecciones de ellos (vectores).
n La distribución de estos datos en memoria sigue una ecuación generalmente sencilla. Por ejemplo, en una matriz almacenada n por filas:
n leer un vector-fila es leer posiciones de memoria consecutivas,
n leer un vector-columna es leer posiciones distanciadas en n, donde n es el número de elementos de una fila.
n El sistema de memoria se diseña de forma que: n se puedan realizar accesos a varios elementos a la vez, n el tiempo de inicio del acceso sea sólo para el primer
n Efecto sobre el sistema de memoria: n Si el espaciado es 1, una memoria entrelazada, o un diseño por
bancos independientes, proporcionan el mejor rendimiento. n Sin embargo, con espaciado no unidad, puede no alcanzarse el
rendimiento óptimo. Ejemplo: si el espaciado es 8, y hay 8 bancos, todos los accesos van al mismo banco (=>secuencialización).
n Por esto, es interesante que el espaciado y el número de bancos sean primos entre sí.
n Como el espaciado es una característica del problema, el compilador puede intervenir si ocurre una coincidencia como la del ejemplo (por ejemplo, añadiendo una columna “hueca” en la matriz).
The two domains of synchronization in OpenCL are work-items in a single work-group and command-queue(s) in a single context. Work-group barriers enable synchronization of work-items in a work-group. Each work-item in work-group must first execute the barrier before executing any beyond the work-group barrier. Either all of, or none of, the work-items in a work-group must encounter the barrier. As currently defined in the OpenCL Specification, global synchronization is not allowed.
There are two types of synchronization between commands in a command-queue:
x command-queue barrier - enforces ordering within a single queue. Any resulting changes to memory are available to the following commands in the queue.
x events - enforces ordering between or within queues. Enqueued commands in OpenCL return an event identifying the command as well as the memory object updated by it. This ensures that following commands waiting on that event see the updated memory objects before they execute.
1.2 Hardware OverviewFigure 1.1 shows a simplified block diagram of a generalized GPU compute device.
Figure 1.2 is a simplified diagram of an ATI Stream GPU compute device. Different GPU compute devices have different characteristics (such as the number of compute units), but follow a similar design pattern.
GPU compute devices comprise groups of compute units (see Figure 1.1). Each compute unit contains numerous stream cores, which are responsible for executing kernels, each operating on an independent data stream. Stream cores,
in turn, contain numerous processing elements, which are the fundamental, programmable computational units that perform integer, single-precision floating-point, double-precision floating-point, and transcendental operations. All stream cores within a compute unit execute the same instruction sequence; different compute units can execute different instructions.
Figure 1.2 Simplified Block Diagram of the GPU Compute Device1
A stream core is arranged as a five-way very long instruction word (VLIW)
processor (see bottom of Figure 1.2). Up to five scalar operations can be co-issued in a VLIW instruction, each of which are executed on one of the corresponding five processing elements. Processing elements can execute single-precision floating point or integer operations. One of the five processing elements also can perform transcendental operations (sine, cosine, logarithm, etc.)1. Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental core) to perform a single double-precision operation. The stream core also contains one branch execution unit to handle branch instructions.
Different GPU compute devices have different numbers of stream cores. For example, the ATI Radeon™ HD 5870 GPU has 20 compute units, each with 16 stream cores, and each stream core contains five processing elements; this yields 1600 physical processing elements.
1.3 The ATI Stream Computing Implementation of OpenCL
ATI Stream Computing harnesses the tremendous processing power of GPUs for high-performance, data-parallel computing in a wide range of applications. The ATI Stream Computing system includes a software stack and the ATI Stream GPUs. Figure 1.3 illustrates the relationship of the ATI Stream Computing components.
Figure 1.3 ATI Stream Software Ecosystem
The ATI Stream Computing software stack provides end-users and developers with a complete, flexible suite of tools to leverage the processing power in ATI Stream GPUs. ATI software embraces open-systems, open-platform standards.
1. For a more detailed explanation of operations, see the ATI Compute Abstraction Layer (CAL) Pro-gramming Guide.
parallel programming such as memory fence operations and barriers. Figure 1.11 illustrates this model with queues of commands, reading/writing data, and executing kernels for specific devices.
Figure 1.11 OpenCL Programming Model
The devices are capable of running data- and task-parallel work. A kernel can be executed as a function of multi-dimensional domains of indices. Each element is called a work-item; the total number of indices is defined as the global work-size. The global work-size can be divided into sub-domains, called work-groups, and individual work-items within a group can communicate through global or locally shared memory. Work-items are synchronized through barrier or fence operations. Figure 1.11 is a representation of the host/device architecture with a single platform, consisting of a GPU and a CPU.
An OpenCL application is built by first querying the runtime to determine which platforms are present. There can be any number of different OpenCL implementations installed on a single system. The next step is to create a context. As shown in Figure 1.11, an OpenCL context has associated with it a number of compute devices (for example, CPU or GPU devices),. Within a context, OpenCL guarantees a relaxed consistency between these devices. This means that memory objects, such as buffers or images, are allocated per context; but changes made by one device are only guaranteed to be visible by another device at well-defined synchronization points. For this, OpenCL provides events, with the ability to synchronize on a given event to enforce the correct order of execution.
Many operations are performed with respect to a given context; there also are many operations that are specific to a device. For example, program compilation and kernel execution are done on a per-device basis. Performing work with a device, such as executing kernels or moving data to and from the device’s local