Hardware reconfigurability: from a concept to implementation

Hardware reconfigurability: from concept to implementation

Schagaev Igor1*, Castano Victor2, Kaegi Thomas2

1 London Metropolitan University 166-220, Holloway Road, N7 8DB, London, UK 2 ITACS Ltd, 157 Shephall View, SG1 1RR, Stevenage, England

* Correspondence should be addressed: [email protected], [email protected]

Abstract. Hardware reconfigurability is essential property of the next generation of computer architectures. We introduce and explain it. We show that reconfigurability is not a property, but a process that should be supported at all stages of design and further functioning. A novel instrument for system reconfigurability, the Syndrome, is described with application and the control of all three zones of computer architectures: active, interfacing and passive. At a lower hardware level, a special reconfiguration element - so-called T-logic is introduced and its operations explained. Using the Syndrome, we illustrate a process of control of reliability and support of graceful degradation. Applications of syndrome for reconfigurability control gaining Performance, Reliability and Energy-wise operations are discussed.

Keywords: reconfigurability; performance-, reliability-, energy-smart functioning; hardware, computer system, hardware syndrome

1. Existing myths and what is really required

The 21st century has started by and with repeating of the previous century mantra on technology limitations and the need of parallelization of computing. Parallelization is discussed at the level of tasks, software, hardware as well as schemes of implementation and support. Thus, accordingly a simple Google search (at 20.02.15) “parallel

programming” provides 6,510,000 entries, “parallel programming model” reaches 2,340,000 entries and “parallel computing” is healthy at 13,400,000. At the same time, typical desktops with 4 or 8 processors, now renamed as “cores” consume from 350 to 850 Watts and surprisingly do not provide “at the order of magnitude” higher performance.

Similarly - transfer of similar amount of “cores” to mobile devices, again, did not gain a lot: we recharge our new mobile phones every day now. Add here: our new devices much more often fail to operate: hangs for no reason, self-restarting and unexplained loosing of battery charge - all this indicates that new designs do not really address challenges of new applications performance, reliability or power efficiency.

Accounting, banking, health monitoring and similar applications are becoming critical and wider spread; the router clusters and network centers consume a visible amount of power generated by power stations. One might ask: “what is missing?” and “why are our designs not greatly efficient?”

Clearly, the myth of parallelization and success of its implementation did not serve a “silver bullet“ role. Surprisingly, the term “reconfigurability” is much less popular in Google search - 243,000 entries.

What is missing? What properties of computer systems do we expect to have in the nearest future? One of those properties is flexibility,

the ability to evolve [2]. In turn, system flexibility assumes the presence of hardware flexibility and system software flexibility.

System flexibility can be considered as several connected properties and supporting features, as described in [1],[2],[3],[4]. We list: flexibility (of hardware and system software), resilience, scalability (task-wise, frequency-wise, technology-wise), performance-, reliability-, energy smart- functioning. We aggregate these properties by the name of PRE-smart systems (performance-, reliability-, energy-). We argue that all mentioned properties are required and should be implemented within a computer system. Good point that it is possible as all of them are based at some level on reconfigurability.

Therefore, efficient design and implementation of reconfigurability becomes a task of utmost importance for the next generation of computer systems.

2. Reconfigurability: a concept

As it is defined in [1], [2], [3] evolving system (EvSy) in terms of evolving properties might be considered and designed as a pair of hardware and system software; formally:

EvSy := <HW, SSW>

Both main elements, hardware and system software, must possess their own reconfigurability features and support each other: a) Hardware support of system software reconfigurability and b) System software support of hardware reconfigurability.

System reconfigurability must be introduced at the design level and pursued along at other levels, especially for critical applications. System reconfigurability should also be considered, throughout the entire life cycle.

Therefore, a system reconfigurability becomes not only a feature or property of a system, but a process. This process serve the need of introduction and maintenance of reconfigurability, including ways of changing configurations and ability to reconfigure.

For example, during critical missions, reconfigurability should be executed within real-time constraints, invisible for applications.

In turn, during regular operation, system reconfigurability should be used to adapt the system to different requirements in terms of efficient performance, reliability and power consumption. We call this PRE-smart functioning as defined in [2].

Reconfigurability for reliability should be implemented with supports the ability of system to recover with minimum time overheads. Here it is worth to mention that reconfiguration might have internal and external reasons. For instance, the system might exclude or isolate some hardware elements from the configuration due to a transient/permanent hardware fault detected via checking schemes (external reason). This isolation should be considered as a process and be “fine-tuned" by minimizing the hardware loss. Additionally, with energy saving in mind, the system could setup a simple hardware configuration for a particular task execution (internal reason).

In energy saving functioning, reconfigurability has to provide a mechanism to disconnect or switch to a lower consumption mode all hardware elements that are not required for the active program processes.

3. Reconfigurability implementation

Hardware of computer systems in terms of information processing consists of three semantically different zones: active, passive and interfacing, as Figure 1 illustrates.

Fig. 1 Computer zones of information processing

At first, the information transformation area – further called active zone (AZ); secondly the information storage area – called passive zone

(PZ). The interconnection of these zones is the interfacing zone (IZ).

Active Zone: The active zone consists of the microprocessor elements including the arithmetic unit (AU) and logic unit (LU). Both units are separated for better fault isolation and easier implementation of hardware tests and, primarily, reconfigurability.

Interfacing Zone: This includes all communication components such as the memory buses and the reconfiguration logic. A configurable bus allows the reconfiguration of the hardware to exclude any failed hardware components and switch into a degraded state, or to replace the failed component with a working one.

Passive Zone: This includes basic storage systems, such as memory, that do not act by themselves but are handled by controllers or devices, saving data.

All three zones have different properties and might use different redundancy mechanisms to tolerate internal faults and to reconfigure efficiently for performance, reliability or energy saving purposes. The proposed computer structure of each zone is shown in

Fig. 2 Reconfigurability of computer system

Note that efficient reconfigurability can be achieved when it is implemented with minimum deliberate redundancy been introduced in the system [4]. In our case hardware redundancy exists in the form of buffer, register files, replicated memory

modules, majority voting schemes and interfacing logic.

With regards to SSW, some extra elements required to support reconfigurability and fault tolerance are: checkpoint monitor, recovery point monitor, process synchronization and reconfiguration monitor. These are named monitors to express their uninterruptible mode of operation [1].

All three zones, (see Fig. 1 and Fig. 2) must be reconfigurable for their own purposes as well as other zones requests. Each zone might have different reconfiguration properties. Interactions between zones define the level of reconfigurability and flexibility of the architecture.

4. System Syndrome

4.1 Definitions

As mentioned earlier, the new system property must be supported by hardware and system software implementation of all required processes that make this property. For this purpose we introduce a special hardware scheme called a Syndrome.

The term Syndrome is new Latin (origin 1535-45) and was originated from Greek “Syndrome” where: “Syn-“ from combination, concurrence. For our purposes a Syndrome is not just passive, i.e. presenting “a snapshot status” of a system, but also active, a serving tool to control the system configuration. For us a Syndrome is

“a group of related or coincident things, events, actions, signs and symptoms that characterize a particular abnormal condition”.

A Syndrome also might help to answer the questions that have been omitted in the vast majority of research on fault tolerance and performance: “what provides the fault tolerance of the system?” and “how big a performance, reliability, energy-saving gain might be achieved?”

It is usually assumed that the hardware core logic is ultra-reliable and guarantees control of configuration and reconfiguration. Unfortunately, using homogeneous redundancy limits the reliability gain - since techniques based on the same type of redundancy are vulnerable to the same threats. Hybrid techniques based on heterogeneous redundancy can be more effective.

Thus, even when memory or processor checking schemes detect error and transfer information to the Syndrome, this information might not be useful if the system does not include either one or both: “External elements” responsible for exercising reconfiguration and making decisions on configuration/ reconfiguration. Reconfiguration might be initiated externally, by other system elements to create best –fit for task configuration, or, if necessary, by “Internal elements” that are capable to initiate the required sequence of reconfiguration for internal purposes and reasons – faults, errors or power-saving.

Indeed, in regular computing systems, when there are faults in the processing logic, to

expect that it is able to perform self-healing

and then control and monitor the configuration of the rest of the system looks like a part of fairy tale, not engineering.

There is a solution though, as described below. To be able to absorb any trustworthy information about the status of system elements we have to aggregate all checking and status signals about the condition of registers, memory, AU and LU as well as control unit. This aggregating scheme we call a Syndrome.

Clear, reconfigurabilities of passive, interfacing and active zones are different. Therefore, a scheme of implementation of reconfigurability should separate the passive zone and active zone of the proposed architecture.

A clear separation of the functions of processing (of data operation) and storing (memory) enables to apply various checking, recovery and reconfiguration solutions and making system more flexible. The Syndrome acts as a control center for three main functions including fault monitoring, reconfigurability and recovery.

Fig 3 illustrates Syndrome application. Fig. 3 Syndrome functions for reconfigurability

These three functions serve for the purpose of performance, reliability and power efficiency.

The principle and function of Syndrome from system software point of view are presented using our Evolving System Architecture (ERA) as described in [3], here we address hardware/ hardware configuration concept that might help to implement reconfigurability as essential property of the next generation of computer systems.

4.2 Implementation

From a hardware point of view the Syndrome is represented as a special register that interacts with the system via hardware interruption schemes. Semantically, the structure of the Syndrome is subdivided in three different areas namely, (Fig. 4): Fault control, Configuration control and Power control areas.

Fig. 4 Syndrome for reconfigurable architecture

4.3 Memory Reconfigurability

The configuration area of the Syndrome reflects the current memory mode that ERA is currently using. The Bit mode field defines whether the addressing mode is 16- or 32-bits, whereas the L/R field defines whether the memory banks are in linear or redundant mode. Bit mode “0” means RAM is used in 16-bit addressing mode; mode “1” is a 32-bit addressing mode. RAM modes define how memory can be used: main (“0”) or redundant (“1”). RAM Module 1, 2, 3 and 4 represent whether the respective memory module is powered: “0” = Power Off; “1” = Power On. The power management area reflects the status of the modules in terms of power.

The combination of the three areas of the Syndrome: Fault management, Configuration management and Power management defines

and controls the state of the system. For example, a memory module could be in the following states: faulty, failed, stand-by, ideal and off-power. Failed state is assigned to the a malfunctioned element by checking schemes, when real reason behind the malfunction and the ability to operate further is still not clarified. When reconfiguration is initiated by software, the states of hardware elements as well as state of Syndrome might be mirrored in system memory.

Without a doubt, the Syndrome is one of the most critical parts of the system. For reliability purposes, the Syndrome should be made virtually failure-free, for example by implementing three copies of the 32-bit register Syndrome connected to a voter within the processing element.

Another option that solves the complexity would be low-level hardening techniques and/or using different technologies (such as flash memory) just for the Syndrome register.

The Syndrome scheme allows to monitor configuration of memory in a pretty flexible manner, allowing the platform to adapt to different application requirements.

For aerospace applications, for example, a flight control system requires highest reliability, which is possible to achieve, for example, by using duplication for ROM and triplication spare for RAM. On the downside of brutal replication approach we are facing the efficiency problem - the available hardware resources for the program execution become much smaller, i.e. only one fourth of the total amount of available memory is used.

This also implies that only the most critical programs should run on this system, all non-safety critical programs should be moved to another system, making flexibility and efficiency even harder to achieve.

When we are acting on redundant schemes only when we need them – using a syndrome monitoring the chosen configuration allows to tolerate permanent faults by reconfiguring the memory, excluding the faulty unit and if possible including a spare one we are able to exploit a system much more efficient. A

syndrome allows to use standard classic triplicated scheme of memory or processor with much higher efficiency.

Thus the Syndrome allows expansion of the working states for memory and other hardware elements of the architecture.

In contrast to classic triplicated configuration with 3 working states reconfigurability of the system supported by proposed Syndrome make overall reliability much higher that known schemes. Our proposed architecture might implement and be operational in 14 different working states, like Markov reliability diagram illustrates at Fig. 5. The dotted lines illustrate

toleration of malfunctions. λ and μ stand for ratio of fault and recovery respectively. For example, a modified triplicated scheme [5] combined with systems reconfiguration for malfunction tolerance [6] enables to achieve an order of five reliability boost. Markov analysis of reliability is useful for the pre-design phase of computer systems. It turn, real time functioning requires an implementation of reconfigurability during the mission making reliability requirements, performance. Therefore using a syndrome makes requirements PRE achievable.

Figure 5 Reliability diagram for reconfigurable RAM using Syndrome

One of the schemes how to make memory of the system really reliable and flexible is presented in Fig. 6. The syndrome register is directly connected to the Memory Management Unit (MMU), which is an extended memory controller with reconfiguration support at runtime. The MMU manages the connectivity of the memory, configures and reconfigures the working mode to a 16-bit single memory, 32-

bit double memory with master/slave configuration or any of the memory addressing schemes available. The Configuration and Power management flags of the syndrome describe the different states of the memory modules. Different values in the configuration area of the syndrome select the bank used and the mode. The output memory lines of the processor determine a location within a memory bank,

whereas the Configuration and Power areas of the syndrome specify which banks are to be used and in which mode. By using this method we can increase the independence of software/hardware configurations for the PRE- purposes. Memory

addresses within the code do not need to be arranged, as code integrity is a crucial requirement for safety critical systems. Special logic schemes -configurators as shown at Fig. 2 are physically included in the MMU.

Figure 6 Syndrome use for memory control

5. Reconfigurability - executive element

Using the proposed concept and implementation of the Syndrome, one might implement reconfiguration of the system providing interconnection and dynamic inclusion or exclusion of hardware components from the working configuration.

For this, we suggest to use so-called “a T-logic inter-connector”, illustrated by Figures 2, 3 and

6, an idealized concept of a hardware switch that from the system point of view behave like switch in the form of a ”T”.

“T” can connect or disconnect hardware elements for the purposes of fault containment, power saving or performance gain. “Rotating” of “T” is virtual and used for illustrative purposes. Detailed description of T-element design is beyond the purpose of this paper for the reason or IPR.

This T-logic serves in the hardware architecture as a scheme that execute configuration scheme, defined by the software or hardware. For example, let us suppose that a fault is detected in a hardware element of a system and thus this element can’t be involved in further program execution, either on a temporary or permanent basis. It should be excluded from working configuration “until further notice”. The four T-logic inter-connectors, one per memory module, are physically contained in the T-logic Management Unit or TLMU. Using the TLMU enables the memory to be configured and reconfigured according to all supported modes shown in (Table 1) and supports module isolation and power management. Table 1 System configuration using T-logic scheme

Configuration Explanation

“T” logic connect all three components with processor. Top system component acts as leading element. The rest system elements

compare the results and participate in voting. This

configuration provides maximum reliability.

This system configuration serves for maximum energy saving. In this case “T” element connects

only one system component with processor, while the rest are idle.

In this case, all three components are used for maximum hardware capacity.When performance of application is main priority this configuration fits the purpose.

6. Conclusion

- System architectures are considered having new properties such as higher efficiency in terms of performance, reliability and energy-smart functioning. - The structural organization of computer systems is introduced as of information

processing including active, passive and interfacing zones. - System-level reconfigurability can be implemented using a new concept and implementation of hardware element called Syndrome; Syndrome aggregates essential information about hardware conditions. - Functions of the Syndrome for reliability, performance and energy-smart functioning have been described and explained. - Reconfigurability of a real-time architecture at the system level was proposed and analyzed in the context of each zone. With regards to the interfacing zone, but not limited to, we propose a new hardware element (T-logic), as a basic element or of execution of reconfiguration, making different configurations possible; - We explained how flexibility, reliability and power-smartness can be achieved using the T-elements; - Taking into account the memory usage has, by design, a high impact on system reliability and power consumption, reconfigurability of the passive zone has been analyzed and described with explanation of configuration control and the phases of hardware degradation. 7. References

[1] Kaegi T, Schagaev I. System software support of hardware efficiency, IT-ACS Ltd, 2013, ISBN ISBN 978-0-9575049-0-5 [2] Monkman S., Blaeser L., Schagaev I. Evolving systems, Proc. FCS'14 - ISBN #: 1-60132-270-4), Editors: Hamid R. Arabnia, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti, pp. 169-179, 2014. http://worldcomp-proceedings.com/proc/p2014/FCS3102.pdf [3] Castano V., Schagaev I. Resilient computer system design, Springer 2015, ISBN 978-3-319-15068-0 [4] http://faculti.net/video?v=68 [5] Buhanova G., Schagaev I . Comparative Study of Fault Tolerant RAM Structures. Proc. IEEE Dependable Systems and Networks Conference, Guteborg, July 2001. [6] Schagaev I. Reliability of malfunction tolerance, PP733-737. Proc. Comp science and Information Technology Conf. , IMCIT08

Hardware reconfigurability: from a concept to implementation

Documents