Top Banner
Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林林林 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C
15

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Dec 13, 2015

Download

Documents

Melina Sanders
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC

Architecture Using TLM 2.0.1

林孟諭

Dept. of Electrical EngineeringNational Cheng Kung University

Tainan, Taiwan, R.O.C

Page 2: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

2

Outline

AbstractIntroductionNoC ArchitectureEncoder Task GraphTask ProfilingApp. Perform on NoCApp. Mapping on ProcessorsResults AnalysisConclusion

Page 3: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

3

Abstract Networks on Chip (NoCs) are commonly used to integrate

complex embedded systems and multiprocessor platforms due to their scalability and versatility.

modeling tools used to describe such architectures at the functional level co-design and error correction is now performed concurrently

This work utilizes a JPEG encoder and maps it onto a cofigurable M N NoC architecture that implements Message Passing Interface (MPI) communication between cores.

Page 4: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

4

Introduction Complexity, scalability and portability are becoming essential topics to be

solved when designing digital systems nowadays. Whilst advances in fabrication technology have allowed embedded

platforms to integrate a high amount of hardware resources the technology to intercommunicate them has been moving from typical hierarchical

bus connections into network-based solutions called Network On Chip. To ease and optimize information in Many-Core architectures, one way to

interconnect cores is through networks. There are also challenges when designing NoCs, both in the HW/SW fields:

Regarding HW, considerations related to topology, router architecture and network interface structure, can lead to considerably different results depending on the design.

On the SW side, the main obstacle is to define the programming model for the NoC-based system, as both shared and distributed memory approaches have their drawbacks.

This paper found the distributed memory model more suitable for a network-based architecture and decided to use it with a message passing structure as the Message Passing Interface (MPI).

The MPI approach allows performing several mappings with little programming effort.

Page 5: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

5

NoC Architecture (1/2) The core of the NoC is composed by routers and network

interface cards (NIC) routers are in charge of delivering the information in form of packets

(flits) from source to destination; network cards receive transactions from end-modules, translate them

into flits and send them to the router's network for distribution. Define router model with the following structure:

1) Switching Technique: Wormhole packet-based.2) Routing Algorithm: Either XY, West-First or North-Last.3) Flow Control: Handshaking ACK/NACK signals.4) Virtual Circuits: Four at each input; one per output port. Variable

depth.5) Link width: 32 bits.6) Output Arbitration: Round-Robin.

Page 6: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

6

NoC Architecture (2/2) As the application has to be written in MPI, all calls to

mpi_send() on one core, must match one mpi_receive() on another.

End-to-end flow control is handled as:1) Call to mpi_send(): The core notifies the NIC to start packing data

and keep it on a local buffer ready to be sent.2) Call to mpi_receive(): The core asks the NIC to send a data-request

message (1 flit long) to the corresponding address so that the transfer starts.

A timer is set to re-send the request after a while if no data is received.

Fig. 1. NoC parameterizable proposed architecture.

Page 7: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

7

Encoder Task Graph In order to obtain a detailed and optimized functional

partitioning, a task graph was created to identify parallelism and temporal dependence.

Fig. 2. JPEG Encoding Algorithm Task Graph.

Page 8: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

8

Task Profiling Some criteria is needed before mapping each task to the NoC

platform, therefore, a profiling for each one is suggested to identify heavy computations and algorithm bottlenecks.

Associated cost were assigned to measure processor time 1 time unit for sums, loads, stores and logical operations 2 time units for multiplications and divisions

For fixed tasks such as the RGB to YUV: for DCT and quantization, it is possible to estimate the number of

operations for encoding and bit-stream writing, they are block-depending

operations and their computing cost will depend on the amount of redundant information of the image.

Table 1. Aver. cost of the JPEG encoding Alg (per iteration).

Page 9: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

9

App. Perform on NoC This work bases on the task graph and profiling to perform

different mappings of the JPEG encoding application to the NoC to analyze its performance.

Each of the listed tasks was manually assigned to the processing units according to the cost.

Page 10: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

10

App. Mapping on Processors (1/4) Three parallel branches compose the JPEG encoding:

1) DCT2) quantization 3) Huffman encoding

There is also sequential behavior occurring at 2 points:1) RGB to YUV 2) bit-stream file writing

the mappings is shown in Fig. 3, on a 22, 32 and 33 NoC were proposed for evaluation.

Fig. 3. JPEG Encoder Evaluated Mappings. Tests were carried on with 4, 6 and 8 processors. Each processor computes one of the tasks shown in Fig.2 for specific image components.

Page 11: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

11

App. Mapping on Processors (2/4) A simulation was conducted for each mapping with a 512512

BMP image. The parameters set during simulation were: mesh topology, XY-

routing, virtual circuit depth 2~10 and network speed half the processor's.

In all cases, the effect of increasing VC depth, slightly reduces execution time for the algorithm,

implies that, for the proposed router architecture, a depth of 2 flits on each virtual circuit, is more than enough.

Fig. 4. JPEG encoder performance on mesh NoCs with XY-Routing, 2 Flits/VC and network speed equal to half the processors's one. Changes in router parameters, as routing algorithm, topology and VC depth, don't yield significant improvements.

Page 12: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

App. Mapping on Processors (3/4) In order to analyse the impact of synthesis technology for the

NoC, router's and NIC's speed was lowered to -3X and -4X X is the processor' speed.

From fig. 5, the mapping appropriately improves the encoding when

for 6 and 8 processors and the network is 3X slower for 8 processor and the network is 4X slower.

12

Fig. 5. JPEG relative performance for network speeds -3X & -4X (X is processor speed). Image size was 512x512 pixels.

Page 13: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

App. Mapping on Processors (4/4) In order to generalize the all results, a final simulation was

performed with different image sizes, see Fig. 6 For the proposed task partitioning and mapping, the gain with

4 processors is around 24-25%, with 6 around 45-46% and with 8, 49-50%, irrespective of the image size.

13

Fig. 6. Application performance for different image sizes.

Page 14: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Results Analysis There is one consistent behavior on the previous subsection:

performance (execution time) increases with the number of cores.

From Fig.4, the gain obtained by increasing 1 to 4 and 4 to 6 processors is around

25~27% each the enhancement acquired from 6 to 8 cores is only 8~9%, but the area

cost is very high. Even though an attempt to cover most significant simulation

aspects at high level was done, it's not clear what criteria should be consider as better:

latency, execution time, computation/communication rate, traffic distribution, area consumption, … etc.

There is no single criteria to solve such a crossroad only design restrictions and specifications might provide a guide to get

to a satisfactory answer.14

Page 15: Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Conclusion It was possible to correctly validate at the functional and

architectural level. several simulations were executed in short time and allowed

performing numerous analysis. The previously results provide the designer with an overview

of the amount of variables. The variables are that have to be taken into account when dealing

with multi-processor platforms on NoC structures.

15