Simulation and Evaluation Framework for Manycore Architectures Andreas Savva, UCY Final Project Report ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ
Feb 23, 2016
Simulation and Evaluation
Framework for Manycore
ArchitecturesAndreas Savva, UCYFinal Project Report
ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ
OUTLINE• Introduction in Many-core architectures.• Main technical objectives of the project.• Project Breakdown.• Work Packages.• Using the developed framework – Case Studies.• Simulation and Results.• Project Outcomes / Deliverables.
Manycore Architectures• Emerging dominant trend in general purpose CPUS• Expected to be interconnected using on-chip networks• Tens to hundreds of cores• Simple cores, large parallelism• Several design parameters
• I/O system• Processor Architecture• Interconnection Network Architecture
• This project aims to:• Develop a simulation and evaluation framework so that
researchers do parameter exploration related to the aforementioned parameters
Main Technical Objectives – Achieved 1. Developed a simulation and evaluation framework
for many-core architectures using JAVA programming language.
2. Developed benchmarks in order to evaluate many-core architectures.
3. Developed on-chip network simulator which supports different architectures / routing algorithms and different traffic patterns.
4. Developed cross-compiler in C/C++ programming language which translates programs into instructions which can be executed from the architectures which are under evaluation.
5. Developed new architectures in order to evaluate the framework.
Project Breakdown• Work Packages:
• Progress and Result Dissemination (WP1, WP2).• Develop simulator in order to interconnect cores (WP3).
• Develop models for the execution units and the cores (WP4).
• Develop Cross-Compiler (WP5).• Create benchmarks to measure performance (WP6).
• Develop new architectures to evaluate the framework (WP7).
WP1 + WP2: PROGRESS + RESULTS DISSEMINATION
Implementation Strategy
WP7EVALUATE
FRAMEWORK
WP3
DEVELOP MANY–CORE SIMULATOR
WP4
DEVELOP EXECUTION
UNITS
WP5
CROSS - COMPILER
WP6BENCHMAR
KS
…OVERLAP…
Project Management (WP1)• Kick-Off Meeting December 2008
• Targeted Application Models Developed• Application Design Trade-Offs• Roles
• Six-Month Progress Reports• 18- Month (Interim) Progress Report• Financial Issues• Final Progress Report
• Final Financial issues
Dissemination of Results (WP2)• Project Website
• http://www.ece.ucy.ac.cy/labs/easoc/Research/SEFMA/home.html
• Publications• Publications in selected Journals and Conferences.
WP3: Simulator for Interconnecting Cores
• Determine specifications for many-core network simulator.
• Evaluate existent simulation frameworks • POPNET simulator – C++ program language.• GPNOC simulator – JAVA program language.
• Adapt simulation framework in order to simulate our many-core systems.
• Develop traffic models based on many-core applications for future evaluation• Random Traffic Pattern.• Tornado Traffic Pattern.• Transpose Traffic Pattern.• Neighbor Traffic Pattern.C O M P L E T E D !
WP4: Core and Execution Unit Models
• Develop communication protocol between units and network
• Design and develop unit models• Cores.• Memory.• Input/output data models.
• Framework to develop models based on the specifications.C O M P L E T E D !
WP5: Cross - Compiler• Create instruction set architecture.• Study existing compilers for RISC processors.• Adapt existing compiler to translate programs into
machine instructions.• Adapt compiler into the framework.
C O M P L E T E D !
WP6: Benchmarks• Define and evaluate all possible functions of the
system based on :• Performance• Power consumption• Reliability
• Develop algorithms to measure performance, power consumption, reliability.
• Develop benchmarks for many-core processors in Assembly language.
C O M P L E T E D !
WP7: Framework Evaluation • WP Goals:
• Develop and evaluate novel many-core architectures.
• Develop and evaluate algorithms for work distribution in many-core processors.
• Cross-evaluation of the developed framework based on the new many-core architectures.
C O M P L E T E D !
USING/EVALUATING THE FRAMEWORKCase Studies
Reducing power consumption
• Power Consumption: Major limitation in NoCs.• Links and NoC routers: the most power-hungry
components.• Intel’s Teraflop NoC prototype suggests that link
power consumption could be as high as 17% and the rest power consumption is dedicated at routers.
• Reduce both static and dynamic power consumption.• Proposed works focus on simple static threshold
mechanisms.Need of new intelligent dynamic
power management policy for NoCs.
Reducing power consumption Threshold based algorithm for turning
links off/on:• Run Simulation and check link utilization.• Choose threshold.• Run simulation.• If new link utilization smaller than threshold
turn link off for a period of time.• After x cycles turn link back on.
NEXT: A new Intelligent Dynamic on/off Link Management for NoCs
based on ANNs.
Artificial Neural Networks• Information processing
paradigm inspired by the way biological neurons process information.
• Composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems.
• Used as prediction and forecasting mechanisms in several application areas
• Able to determine hidden and strongly non-linear dependencies.
Reducing power consumption
Input layer Hidden layer
{ { {
Output neuron
Reducing power consumption Intelligent ANN algorithm:• Pre-training.
• Choose links with minimum link utilization
• Size of network more manageable
• Prediction scheme based on ANN• Divide network into smaller nets• Pass chosen links as inputs in
ANNs• Output links to turn off
Power Saves for 8x8 mesh and torus networksANN can be used for prediction since
they can discover hidden dependencies.
Reducing power consumption
PE
PE
PE
PE
PE PE PE
PE
PE
PE
PE PE
PE
PE
PE
PE
ANN
ANN 1 ANN 2
ANN 3 ANN 4
ANN predictor with NoCs and an 8×8 network partition into four 4×4 networks with their ANNs.
Reducing power consumption • Experiments with several NoC
regions.• Compare hardware overheads
and responding power savings.• 4×4 NoC region offers
satisfactory power savings and less ANN overheads when compared to a 5×5 NoC region.
• 3×3 NoC region does not provide enough information to the ANN in order to make accurate predictions.
• We designed the based ANN system to monitor 4x4 NoC regions. Power Saves and hardware
overheads for 3x3, 4x4,5x5 NoC regions
Reducing power consumption Prediction scheme based
on ANN• ANN mechanism
receives all the average link utilizations from all the links of the 4×4 NoC partition.
• ANN uses the utilization values to find optimal threshold
• Determine if a link is going to be turned off or on for the next n-cycle interval.
ANN mechanism
Intelligently computed threshold
Yes/timeout
No
Receive link utilization for a 4x4 NoC
partition
Neural Network
Chose links based on threshold
Output Control packets to turn on/off links
Monitor link utilization
Receive from ALL links
completed?
Next time interval
Reducing power consumption ANN hardware optimization
• A 4x4 ANN monitors 16 routers => at least 8 input neurons.
• Eight neurons at the input layer of the ANN => hidden layer should have five neurons.
• Based rule of thumb that a satisfactory number of the hidden layer neurons equals to half the number of input neurons plus one neuron.
Try to minimize the size of the hidden layer…
Reducing power consumption • Choose appropriate
size of the hidden layer of the ANN.
• Three different ANNs were developed with five, four and three neurons at the hidden layer.
• Using four neurons (instead of five), in the hidden layer exhibits the best power savings for all the traffic patterns.
Power Savings for different neuron sizes in the hidden layer
Reducing power consumption • How the bit representation
of the training weights affects the threshold computation?
• 24, 16, 8, 6 and 4 bit representations were used.
• 24, 16, 8 and 6 bits show similar power savings, but these savings are significantly reduced when 4 bits are used, due to reduced training accuracy.
• => 6 bits are chosen, which made the multiplier-accumulation hardware very small
Power savings for different training weight bit
representations
Simulation and Results...• Power savings of the
ANN-based mechanism are better than the savings in the other cases.
• ANN-based mechanism can identify a significant amount of future behavior in the observed traffic patterns.
• Can intelligently select the threshold necessary for the next timing interval. Power Saves for 8x8 mesh
and torus networks
Simulation and Results...• Measure throughput
in each mechanism.• Having no on/off
mechanism yields a higher throughput, the ANN-based technique shows better throughput results compared to statically determined threshold techniques. Throughput for 8x8 mesh
and torus networks
Simulation and Results...• Measure energy in each
mechanism.• Energy consumed using
ANN mechanism is less than the other cases.
• The ANN exhibits a reduction in the overall energy, because of a balanced performance-to-power savings ratio, when compared to not having on/off links or when compared to static threshold computation.
Normalized Energy for 8x8 torus networks
Simulation and Results...• Measure packet
latency in each mechanism.
• The ANN-based mechanism incurs more delay, but we believe that the delay penalty is acceptable when compared to the associated power savings. Average Packet Latency
Reducing power consumption New Intelligent ANN
algorithm:• Pre-training.
• Choose router ports with minimum port utilization
• Size of network more manageable
• Prediction scheme based on ANN• Divide network into smaller nets• Pass chosen ports as inputs in
ANNs• Output ports to turn off
ANN mechanism
Intelligently computed threshold
Yes/timeout
No
Receive port utilization for a 4x4
NoC partition
Neural Network
Chose ports based on threshold
Output Control packets to turn ports on/off
Monitor port utilization
Receive from ALL ports
completed?
Next time interval
Reducing power consumption • When the router ports become unavailable,
temporarily or permanently, X-Y routing cannot guarantee deadlock free system.
• Since router ports are turned off in our work, a new routing algorithm must be developed in order to make sure that there are no deadlocks.
• Fully adaptive routing algorithms perform better in the cases of faults but they are very difficult to implement due to higher overhead in silicon area and energy consumption.
• Based on this, a partially adaptive routing algorithm was chosen in order to achieve a certain degree of fault tolerance in our system.
Reducing power consumption • Fault Tolerant Negative First
algorithm is based on the turn models.
• It makes certain turns forbidden so that the deadlock can be avoided.
• A packet is routed at first in the negative direction in each dimension and then, it is routed at the positive direction. The forwarding message at first moves to west or south until the offset is zero and after that it moves to the north or east.
Negative First Routing Algorithm in 8x8 Mesh
network
Simulation Results• The power savings
of the ANN-based mechanism are better compared to statically-determined case, and the case without any on/off ports for all the traffic models.
Power Saves for 8x8 mesh and torus networks
Simulation Results...• Having no on/off
mechanism yields a higher throughput; however, the ANN-based technique yields better throughput when compared to the statically-determined threshold
Normalized throughput for 8x8 mesh and torus
networks
Results from the framework use• Framework can be used from researchers in order to
evaluate many-core architectures.• It helps to compare how the number of cores affects
the total power consumption of the network.• Intel showed that the number of cores may be affected
from the power consumption because of the increase number of routers, interconnects and data travelling through the network.
• Researchers can do parameter exploration related to many-core architectures.
• This new Network on Chip framework helps researchers to solve different NoC tasks through simulations.
Project Outcomes• Smooth flow of work
• Some simulator problems have been overcome• Help from Dr. Soteriou and Drs. Michael and Chadjicostis
• Results Dissemination on target with Project Goals.
• Publications in conferences/journals• Participation in ISVLSI Conference July 2011, Chennai, India.• Publication in Journal of Electrical and Computer
Engineering, Hindawi Publishing Corporation, 2012.• Submission at the ISVLSI 2012: paper for turning router
ports on/off. (Under Review)
PublicationsARTICLES:• A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Link
Management for On-Chip Networks”, In Proc. IEEE Annual Symposium on VLSI, pp. 343 – 344, July 2011.
• Under Review: A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Router Ports Management for Networks on Chip”, ISVLSI Conference 2012
JOURNALS:• Andreas G. Savva, T. Theocharides, V. Soteriou, "Intelligent On/Off
Dynamic Link Management for On-Chip Networks," Journal of Electrical and Computer Engineering, vol. 2012, Article ID 107821, 2012
POSTER:• Poster at HiPEAC Ph.D. Student Poster Presentation - Paphos,
Cyprus, January 2009.WORKSHOP:• Results of this work were presented in a workshop at KIOS
Research Centre – 30 Nov. 2011
Project Deliverables:• D1: Six Month, Interim, Final Report, Financial Reports• D2: Project Website, Publications• D3: Network communication simulator in JAVA, Four
traffic models for purposes of simulation and evaluation of the network (Available source code)
• D4: RISC processor models, memory models, core models, Input Output models (VHDL/C++ Code)
• D5: Cross-compiler • D6: Benchmarks, Algorithms for power consumption
and performance measurements.• D7: Many-core architectures, Evaluation of the
developed framework.
Acknowledgements to:• Dr. Maria K. Michael – for the verification and
automation algorithms feedback.
• Dr. Christoforos Hadjicostis – for the reliability aspects and the discrete event algorithms employed in building the simulator.
• Dr. Vassos Soteriou - for the feedback on the Interconnect.
• Dr. Theocharis Theocharides - for the coordination of this project and all the help.
ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ
This work falls under the Cyprus Research Promotion Foundation’s Framework Programme for Research,
Technological Development and Innovation 2008 (DESMI 2008), co-funded by the Republic of Cyprus and the
European Regional Development Fund, and specifically under Grant PENEK/ENISX/0308
THANK YOU!Project Host Organization
University of CyprusAndreas Savva, Theocharis Theocharides , Maria K.
Michael, Christoforos Hadjicostis
Collaborating Partners
Cyprus University of TechnologyVassos Soteriou