1 Multi-Core System on Chip 설설 설설 2 설설 : 설설설 설설 2003 설 12 설
Feb 04, 2016
2
What is Software Radio
- A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control
- Flexible all-purpose radios that can implement new and different standards or protocols through reprogramming.
- Same hardware for all air interfaces and modulation schemes
3
Key Technological Constraints
• High speed wide band ADCs.• High speed DSPs. • Real Time Operating Systems (isochronous
software)• Power Consumption
4
Research and Commercialization
• DARPA’s Adaptive computing system project
• Virginia Tech – algorithms and architecture ; multi user receiver based on reconfigurable computing ; generic soft radio architecture for reconfigurable hardware
• UC Berkeley – Pleiades, ultra low power, high performance multimedia computing ; high power efficiency by providing programmability
• Sirius Inc – Software Reconfigurable Code Division Multiple Access (CDMAx)
5
Research and Commercialization
• Brigham Young University – Development of JHDL to facilitate hardware synthesis in reconfigurable processors
• Chameleon Systems- Reconfigurable Platform Architecture for wireless base station
• MorphIC Inc -Programmable hardware reconfigurable code using DRL
• Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip) baseband algorithms
6
Applications
• User Applications and Base Station Applications
• Evolve as a universal terminal• Spectrum management:
Reconfigurability is a big advantage• Application updates, service
enhancements and personalization
7
Programmable OFDM-CDMA Tranceiver.
• CDMA suffers from Multiple access interference and ISI.
• OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.
• It is proposed that this might be implemented by using SDR.
8
SDR ArchitectureSignal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
Hitachi Kokusai Electric Inc., [email protected]
9
Signal processing/control unit
• The signal processing/control unit consists of the following module– Data converter– Quadrature Modem– Baseband Modem– Interface/Control
• Every module is connected to each other by PCI bus, and provides a CPU in addition to the FPGA and DSP devices.
10
Quadrature modem module• The Quadrature modem uses
FPGAs to process to generate baseband samplin
g rate
– Quadrature modulation– Quadrature detection– Sampling rate conversion– Filtering
Signal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
11
Baseband modem module• The Baseband modem proces
ses– Multi-channel modulation– Multi-channel demodulation
• Using four floating points DSP devices
• individual DSP is assigned for each channel. Therefore, even if processing of either channel is under execution, a program can be downloaded to another channel.
Signal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
12
A SDR/Multimedia SolutionW-CDMA / DAB / DVB / IEE802.11x; MPEG / JPEG Codecs
15
Architecture Goals• Provide template for the exploration of a range of
architectures
• Retarget compiler and simulator to the architecture
• Enable compiler to exploit the architecture
• Concurrency– Multiple instructions per processing element– Multiple threads per and across processing elements– Multiple processes per and across processing
elements
• Support for efficient computation– Special-purpose functional units, intelligent memory,
processing elements
• Support for efficient communication– Configurable network topology– Combined shared memory and message passing
16
Architecture Template• Prototyping template for array of processing elements
– Configure processing element for efficient computation– Configure memory elements for efficient retiming– Configure the network topology for efficient communication
FUFU FU
RegFile
Memory
ICache
DCT HUFFUFU FU
RegFile
Memory
ICache
FU FU FUFU FU
RegFile
Memory
ICache
DCT HUF
Memory
RegFile
...configurePE...
...configurememoryelements...
...configure PEsand network tomatch the application...
17
Future Processing Element• Specialized memory systems for efficient memory utility
– Multi-ported, banked, levels, and intelligent memory
• Split register file allows greater register bandwidth to FUs– Groups of functional units have dedicated register files
• Multiple contexts for a processing element provide latency tolerance– Hardware for efficient context switching to fill empty
instruction slots
• Specialized functional units and processing elements– SIMD instructions– Re-configurable fabrics for bit-level operations– Re-use IP blocks for more efficient computation– Custom hardware for the highest performance
18
Initial Distributed Architecture
• Array of concurrent PEs and supporting network
• Malleable network topology
– Topology matches application
• Efficient communication• Memory organized around a PE
– Each PE has physical memory
– Message passing between PEs
PE PE
PE PE
PE
PE
PE PE PE
PE PE
PE PE
PE
PE
PE PE PE
19
Future Distributed Architecture
• Multiple processing elements share a memory space– Shared memory communication
• Snooping cache coherency protocol• Directory based protocol required if PEs in a
shared memory space is large• Introspective processing elements
– Use processing elements to analyze the computation or communication• Identify dynamic bottlenecks and remove them
on the fly• Reschedule and bind tasks as the introspective
elements report
20
So What’s Different?
• Traditional application hw/sw design requires– Hand selection of traditional general purpose OS components– Hand written customization of
• device drivers• memory management…
• Instead…– Application specific synthesis of OS components
• scheduling• synchronization…
– Automatic synthesis of hardware specific code from specifications• device drivers• memory management…
21
ASIP Design• Given a set of applications, determine micro arch
itecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)
• To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.
• The micro architecture of the processor is a design parameter!
23
Compiler Goals• Develop a retargetable compiler infrastructure that enable
s a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures.
• 10 Year Vision: – Will have fully automatically-retargetable compilation, O
S synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories
– Compiled code size and performance will be within 10% of hand-coding
24
Compiler Research Issues
• Synthesis of RTOS elements in the compiler– On the application side: Generation of an efficient application-spe
cific static/run-time scheduler and synchronization– On the hardware side: Generation of device drivers, memory mana
gement primitives, etc. using hardware specifications• Automatic retargetability for family of target architectures while preser
ving aggressive optimization• Automatic application partitioning
– Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model
• Effective visualization for family of target architectures
25
An Efficient Architecture Model for Systematic Design of Application-Specific M
ultiprocessor SoCDATE’ 2001
Amer Baghdadi Damien Lyonnard Nacer-E. Zergainoh Ahmed A. JerrayaTIMA Laboratory, Grenoble, France
26
Efficient application-specific multiprocessor design
• Modularity
• Flexibility
• Scalability
27
A multiprocessor architecture platform for application-specific SoC design(1)
Figure 1. A multiprocessor architecture platform
28
A multiprocessor architecture platform for application-specific SoC design(2)
• Architecture platform parameters
1. Number of CPUs,
1. Memory sizes for each processor
2. I/O ports for each processor
3. Interconnections between processors
4. Communication protocols and the external connections (peripherals)
29
Application-specific multiprocessor SoC design flow (1)
Figure 2. The Y-chart: MFSAM-based architecture generation scheme
30
Application-specific multiprocessor SoC design flow(2)
Figure 3. MFSAM-based architecture generation flow for multiprocessor SoC
32
Architecture design(2)
Figure 5. Block diagram of the packet routing switch (Point to Point network)
33
Architecture validation
Figure 6. A 4-processor cosimulation architecture of the packet routing switch
34
Analyzing the design cycle (1)
Figure 7. A 4-processor cosimulation architecture of the IS-95 CDMA
35
Analyzing the design cycle (2)
Table 1. Time needed to fit the IS95 CDMA on the multiprocessor platform
36
Conclusion
1. Presented a generic architecture model for application-
specific multiprocessor system-on-chip design
2. The proposed model is modular, flexible and scalable.
3. Definition of the architecture model and a systematic
design flow that can be automated.
37
A Single-Chip Multiprocessor
• Currently, processor designs dynamically extract parallelism by executing many instructions within a single, sequential program in parallel.
• Future performance improvements will require processors to be enlarged to execute more instructions per clock cycle.
• Two alternative micro-architectures that exploit multiple threads of control
– SMT : simultaneous multithreading– CMP : chip multiprocessor
38
A Single-Chip Multiprocessor
• Exploiting parallelism
– Loop level parallelism results when the instruction level parallelism comes from data independent loop iterations.
– Some compiler can also divide a program into multiple threads of control, exposing thread level parallelism.
– A third form of very coarse parallelism, process level parallelism, involves completely independent applications running in independent processes controlled by the operations system.
39
Exploiting Program Parallelism
Instruction
Loop
Thread
Process
Leve
ls o
f P
aral
lelis
m
Grain Size (instructions)
1 10 100 1K 10K 100K 1M
40
SMT (simultaneous mutlithreading)
• SMT processors augment wide (issuing many instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently
• Dynamically selecting and executing instructions from many active threads simultaneously.
• Higher utilization of the processor’s execution resources
• Provides latency tolerance in case a thread stalls due to cache misses or data dependencies.
• When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.
41
Single-vs Multi-threaded
multithreaded/non-blocking: CPU continues to execute along With accelerator.
single-threaded/blocking: CPU waits for accelerator;
42
Mutithreading– Multiple threads to share the functional units of
a single processor in an overlapping fashion.– The processor must duplicate the independent
state of each thread. (register file, a separate PC, page table)
– Memory can be shared through the virtual memory mechanisms, which already support multiprocessing
– Needs hardware support for changing the threads.
43
Single-Chip Multiprocessor
• CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.
• If an application cannot be effectively decomposed into threads, CMPs will be underutilized.
47
Superscalar Issue
Superscalar leads to more performance, but lower utilization
48
Maximum utilization of function units by independent operations
Simultaneous Multithreading
50
SMT Architecture8 separate PCs , executes instructions from 8 diff thread concurrently
Multi bankcaches
51
Chip multiprocessor architecture
8 small 2 issue superscalar processors. Depend on TLP
52
Single-chip multiprocessor Kunle Olukotun http://www-hydra.stanford.edu
– Shared 2nd-level cache – Low latency interprocessor com-
munication (10 cycles)– Separate read and write buses
Four processors Separate primary caches Write-through data caches to
maintain coherence
Write-through Bus (64b)
Read/Replace Bus (256b)
On-chip L2 Cache
DRAM Main Memory
Rambus Memory Interface
CPU 0
L1 Inst. Cache L1 Data Cache
CPU 1
L1 Inst. Cache L1 Data Cache
CPU 2
L1 Inst. Cache L1 Data Cache
CPU 3
L1 Inst. Cache L1 Data Cache
I/O Devices
I/O Bus Interface
CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller
Centralized Bus Arbitration Mechanisms
53
Characteristics of superscalar, simultaneous multithreading, and chip multiprocessor
54
CMP and Memory
• A 12-issue superscalar or SMT processor can place large demands on the memory system.
• The CMP architecture features sixteen 16-Kbyte caches.– The small cache size and tight connection
to these caches allows single-cycle access.
55
CMP Solution
• Short cycle time to be targeted with relatively little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components.
• Since OS allocates a single s/w thread of control to each processor, and requires no h/w to dynamically allocate instructions to different clusters
• Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.
56
A Single-Chip Multiprocessor
• Relative performance of superscalar, simultaneous multithreading, and chip multiprocessor architectures
57
Multi-core SoC Platform Integration using AMBA
DesignCon 2002 System on Chip and IP Design Conference
Robert L. Veal, Levon Petrosian, Neal Stollon
58
Overview of AMBA AHB
AMBA Application to Multiprocessor Systems(RAMA)
Summary
OutlineOutline
Core integration is significant part of Soc Design - Including both RISC and signal processing engines - Well defined bus strategies make it easier
AHB being adopted based on both features and standardization - Low overhead for core-to-memory communication - Standard interface increases IP value - RAMA integrates RADcore and OMNIcore using AHB along with memory blocks, arbiters and external interfaces
AMBA Based Integration for SoC PlatformsAMBA Based Integration for SoC PlatformsOverview of AMBA AHBOverview of AMBA AHB
AMBA : Advanced Microcontroller Bus ArchitectureAMBA : Advanced Microcontroller Bus ArchitectureAHB : Advanced High-performance BusAHB : Advanced High-performance BusRAMA : Reconfigurable Array Multimedia ArchitectureRAMA : Reconfigurable Array Multimedia ArchitectureRADcore : Infinite Technology Corporation’s proprietary cores for RADcore : Infinite Technology Corporation’s proprietary cores for reconfreconfigurable signal processingigurable signal processingOMNIcore : Infinite Technology Corporation’s proprietary core for OMNIcore : Infinite Technology Corporation’s proprietary core for genergeneral purpose RISC processing al purpose RISC processing
Key to AHB - Definition of master and slave AHB components - Master : initiate operation by sourcing address and control signals for a bus operation - Slave : respond and perform operations under the control of a master, memories and peripherals
Attractive Key Features of AMBA AHB - Configurable data bus size (8 ~ 1024bits) - Dedicated request/grant and bus locking signals - Flexible (user-defined) arbiter based bus control - State based handshaking between master and slave - No tri-stated business; mux based unidirectional operation
Value of AMBA Interfaces in Core Value of AMBA Interfaces in Core IntegrationIntegration
Overview of AMBA AHBOverview of AMBA AHB
AHB Principle of OperationsAHB Principle of Operations
Overview of AMBA AHBOverview of AMBA AHB
AHB Principle of OperationsAHB Principle of Operations
Overview of AMBA AHBOverview of AMBA AHB
Specific datapath structure and signaling of a multiplexed bus - Interconnection of multiple masters and slaves is handled by multiplexors - On-chip bussing based on a arbitrated request/grant approach - Bussing of two types of interface
• Master interfaces : initiate transactions through granted requests and source of address and communication parameters of a data transfer• Slave interfaces : respond to master requests and provide status of requested transactions
High-performance system bus - Supports multiple bussed cores and provides high-bandwidth operation - Single-edge timed, multiplexed data bus controlled by arbitration logic - All busses and signals are unidirectional as an on-chip bus structure
AHB VariantsAHB Variants
Overview of AMBA AHBOverview of AMBA AHB
Specifics of interconnection structure - Open to the user
Different bus structures and levels of transfer bandwidth - Characterized by number of masters and bus layers (sub-buses) - Efficient customization of the architecture within the standardized platform framework
Usage for multi-processor core platforms - Several types of busses are concurrently used for control and high data transfer in inter-core communications
Single-layer/Single-master AHBSingle-layer/Single-master AHB
Overview of AMBA AHBOverview of AMBA AHB
- Known as AHB-Lite, reduced complexity version - A single master : no contention for bus ownership, no arbitration - No arbitration : no implementation of request and grant signals
Single-layer/Multi-master AHBSingle-layer/Multi-master AHB
- Ensure that a given master gains and maintains access to the bus - Increase the performance of data transfers between multiple signal processors and memories
Multi-layer/Single-master AHBMulti-layer/Single-master AHB
Overview of AMBA AHBOverview of AMBA AHB
- Concurrently accessing common slave resources - The number of masters determines the number of bus layers - Each master has a dedicated bus
Multi-layer/Multi-master AHBMulti-layer/Multi-master AHB
- Each master has a dedicated bus in multi-layer - Both masters and slaves access a common set of bus resources - The number of bus layers defined by the number of slaves requiring concurrent data transfer
Master Slave CommunicationMaster Slave Communication
Overview of AMBA AHBOverview of AMBA AHB
Both the AHB master and slave have embedded (4-state) state machines - Allow communication for master-slave and multiple maser status Specifics of the FSM operation - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used
RAMA Block DiagramRAMA Block Diagram
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
RAMARAMA
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
High-performance multi-core platform for addressing datapath applications - Standardizes the on-chip bus operation by adopting AMBA AHB Integration of ITC’s RADcore and OMNIcore processor cores - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used RADcore - Signal processing engine : parallel processing, Reconfigurable Arithmetic Datapath (RAD) features - Data interface : Initialization I/O EXU, memory bus interfaces, RADbus interface OMNIcore - 32-bit cryptographic/RISC architecture - High-performance RISC processor with a dual memory bus interface - Uses AHB as its central bus structure Other elements - Memory blocks, an external memory interface core, arbitration logic
RAMA Multi-layer using AHBRAMA Multi-layer using AHB
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
RAMA Multi-layer using AHBRAMA Multi-layer using AHB
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Inter-core communication is based on two AHB busses - Separate and reduce any interdependence of control and data access - Control interface : a single layer AHB with the OMNIcre control/ROM port and the External memory DMA - Slaves are the boot and Local (instruction) Memory and the RADCore control interfaces - Single the inter-core control and memory update operations are intermittent Data transfer AHB has up to six master - OMNIcore Data/RAM port and the external memory DMA, along with up to four RADcore I/O ports - To facilitate high-bandwidth multi-core performance, the data transfer AHB is a multi-layer AHB structure
Interfaces of Multiple System Domains in Interfaces of Multiple System Domains in RAMARAMA
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Key communications interfaces of RAMA① RADcore to on chip memory array data
read/write operations② OMNIcore to on chip memory array dat
a read/write operations③ RADcore to External Memory Buffer rea
d/write operations④ OMNIcore to External Memory Buffer re
ad/write operations⑤ RADcore-to-RADcore data transfers⑥ RADcore to external logic data transfer
s⑦ External memory (DMA) to internal me
mory array read/write operations⑧ OMNIcore to RADcore control read/writ
e operations⑨ OMNIcore to local (scratch) RAM read/
write operations⑩ OMNIcore to (boot) ROM read operatio
ns
RADcore OverviewRADcore Overview
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
A High Performance Reconfigurable Signal Processor with Distributed IW Architecture
A core controller/sequence block, a DIW Instruction Memory, a set of Execution Units(EXUs), data I/O, external logic interface
The initialization busses, Reconfigurable Channel Bus (RCB) and the supporting Flags encapsulate and interconnect each EXU
Key features• 15 channel Reconfigurable data bus based architecture• Reduces register based operations• User definable pipeline depth• Distributed instruction word driven parallel operation• Supports highly pipelined dataflow• Configuration selectable by designer (up to 11 EXUs)• AMBA compatible Memory and core to core busses• Spreadsheet based RADware programming environment
RADcore InterfacesRADcore Interfaces
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Controller interface - between the RADcore and host processor Memory interface
- both on chip RAM block and off-chip memory interfaces RADbus interface
- RADcore-to-RADcore, initialization I/O EXU External Logic Buffer
- co-processing with arbitrary external logic
OMNIcore OverviewOMNIcore Overview
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Key features• 32-bit RISC engine• Cryptographic support• AHB compliant control
and RAM busses→ User-selectable 8
to 32-bit operation• 4 stage pipeline
→ Low interrupt latency
• Two privilege levels user, system→ Supports smart
card applications
OMNIcore OverviewOMNIcore Overview
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Two primary interface for instruction operation (Ctrl) and data read/write (RAM)
- Access to memory bus for on chip memory and external memory operation using its RAM interface
- Access to a local control bus for loading of instruction data into instruction cache and for supervisory and status communications with RADcore control blocks using its Control interface
Dual master AHB interface to integrate control and data functions
- Data output bus is shared - Instruction cache internal to the OMNIcore subsystems is used to avoid
stalling
OMNIcore Crytographic FeaturesOMNIcore Crytographic Features
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Public-private key cryptographic algorithms - DES, RSA, DSA and Diffie-Hellman - Controlled by a set of cryptographic instructions
Cryptographic Instruction supports for - Compression Permutation - Expansion Permutation - Initial Permutation - Final Permutation - Key Permutation - Key Rotation - P-Box Permutation - S-Box Permutation
RAMA Memory SubsystemRAMA Memory Subsystem
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Distributed memory block architecture, consisting of dual port memory blocks
Key features - Dual Port RAM blocks - Multi-layer AHB for simultaneous memory access - Dual Mode External Memory Interface
→ DMA interface for internal – external memory transfer (AHB Master)
→ Buffer for processor – external memory transfers (AHB Slaves)
- Multi-layer Arbiter→ Priority based
AHB ArbitrationAHB Arbitration
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
Multi-layer arbitration scheme - To coordinate concurrent processor-memory transfers between masters (OMNIcore, m
ultiple RADcores, external memory DMA) and slaves (memory, external memory buffer)
A Configurable Master/Slave PortA Configurable Master/Slave Port RADbus AMBA AHB features - Allows direct processor to process communic
ation - Hybrid (configurable) Master/Slave interface - Mode dependent changes in AHB operation - All write operations in master mode - All Read operations in slave mode - Uses first-come, first-serve method for arbitra
tion - Low overhead ensures fast operation
AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems
A Configurable Master/Slave PortA Configurable Master/Slave Port Structure of RADbus AHB scheme - All Out-puts are defined as bus masters, str
uctured as a Write-only Master - All In-ports are defined as bus slaves, struc
tured as a Write-only Slave - The number of RADcores connecting to the
RADbus determines the size of the address - Selection of which bus channel (A, B, C) is r
ead into the RADcore is defined as function of decoded address bits from the master in conjunction with the state of the slave
- Selection algorithm is based on a “first-com, first-serve” selection mechanism by the read mux, controlled by an address decoded select signal (a, b, c) for each bus
RAMA discussed as platform-based solution Uses multiple AHB for core-to-core integration AHB easily integrated into RAMA architecture AHB provides well understood, flexible interfaces RADbus example shows AHB can be flexible Combination of OMNIcore and RADcore provides enhanced DSP and data processing Extends platform to reach emerging SoC applications
SummarySummary
SummarySummary
Cores +Infrastructure + Integration = SoC PlatformCores +Infrastructure + Integration = SoC Platform(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA
81
Lightweight Implementation of the POSIX Threads API for an On-Chip MIPS
Multiprocessor with VCI Interconnect
82
Contents
• Target architecture• MIPS CPU properties• The architecture needs• Pthread specification• Implementation• Experimental setup• Conclusion
83
Target architecture
General VCI based SoC architecture
• System consist of one or more MIPS R3000 as CPU
• Virtual Chip Interconnect compliant interconnect
84
MIPS CPU properties
• Two separated caches for instruction and data.
• Direct mapped caches.• Write buffer with write update and write
through policy.• No memory management unit (MMU),
logical addresses are physical addresses. ( the total memory is fixed at design time )
85
The architecture needs
• Protected access to shared data : Use spin lock– Spin lock is acquired using the pthread_spin_lock– Spin lock is released using the pthread_spin_unlock
• Cache coherency – if the interconnect is a shared bus, use snoopy cache.
• Reduce main memory traffic.
– if the interconnect is VCI compliant( or bus or network), need flush caches.
• Processor identification– CPUs must have an internal register allowing their identific
ation within the system.
86
Pthread specification
• Main kernel objects are the threads and the scheduler.
Execute the thread : ‘start’ function call
Thread attribute : stack size, stack addr, scheduling policy
Unique identifier for the thread
87
Pthread specification
• Changing state is done using some pthread function on a shared object.
• From RUNNABLE to RUN is done by the scheduler. Backward from RUN to RUNNABLE using sched_yield.
• A thread structure contains the context of execution of a thread and pointers to other threads.
88
Pthread specification
• The scheduler manages 5lists of threads.– Symmetric Multi-Processor(SMP) : Scheduler may be shar
ed by all processors.– Distributed : Scheduler exist every processors.
• The access to the scheduler must be performed in critical section, and under the protection of a lock.
• Other implemented objects– Spin lock : the low level test and set access – Mutex : sequentialize access to shared data– Semaphore : sem_post is the only function that can be c
alled in interruption handlers.
89
Implementation
• The scheduler_created variable must be declared with the volatile type qualifier to ensure that compiler will not optimize this seemingly infinite loop.
◈ Booting sequence
90
Implementation
• Context Switch– Save the current value of the CPU registers into context
variable of the thread that is currently executing
– Sets the values of the CPU registers to the value of the context variable of the new thread to execute.
– The return address of the function is a register of the context
– Restoring a context sends the program back where the context was saved, not to the current caller of the context switching routine.
92
Experimental setup
• Review several types of scheduler– Symmetric Multiprocessor (SMP)
• Unique scheduler shared by all processors and protected• The threads can run on any processor, and migrate
– Centralized Non SMP (NON_SMP_CS)• Unique scheduler shared by all processors and protected• Every thread is assigned to a given processor and can run
only on it
– Distributed Non SMP (NON_SMP_DS)• Many schedulers as processors, and as many locks as
schedulers• Every thread is assigned to a given processor and can run
only on it
93
Experimental setup
Execution times of the MJPEG application Cycles spent in the CPU idle Loop
◈ Motion JPEG application
94
Experimental setup ◈ COMM application
• Does not exchange data between processors.
• The only resource shared here is the bus
• The application uses the processors at about full power.
95
Conclusion
• The implementation is a bit tricky, but quite compact and efficient.
• Experimentations have shown that a POSIX compliant SMP kernel allowing task migration is an acceptable solution in terms of generality, performance and memory footprint for SoC.