Hardware Fault Tolerance and Redundancy

Most Realtime systems must function with very high availability even under hardware fault conditions. This article covers several techniques that are used to minimize the impact of hardware faults.

Redundancy Schemes

Realtime systems are equipped with redundant hardware modules. Whenever a fault is encountered, the redundant modules takeover the functions of failed hardware module. Hardware redundancy may be provided in one of the following ways:

One for One Redundancy

Here, each hardware module has a redundant hardware module. The hardware module that performs the functions under normal conditions is called Active and the redundant unit is called Standby. The standby keeps monitoring the active unit at all times. It will takeover and become active if the active unit fails. Since standby has to takeover under fault conditions it has to keep itself synchronized with the active unit operations.

Since the probability of both the units failing at the same time is very low, this technique provides the highest level of availability. The main disadvantage here is that it doubles the hardware cost.

N + X Redundancy

In this scheme, if N hardware modules are required to perform system functions, the system is configured with N + X hardware modules; typically X is much smaller than N. Whenever any of the N modules fails, one of the X modules takes over its functions. Since health monitoring of N units by X units at all times is not practical, a higher level module monitors the health of N units. If one of the N units fails, it selects one of the X units ( It may be noted that one for one is a special case of N + X).

The advantage lies in reduced hardware cost of the system as only X units are required to backup N units. However, in case of multiple failures, this scheme provides lesser system availability.

Load Sharing

In this scheme, under zero fault conditions, all the hardware modules that are equipped to perform system functions, share the load. A higher level module performs the load distribution. It also maintains the health status of the hardware units. If one of the load sharing module fails, the higher level module starts distributing the load among the rest of the units. There is graceful degradation of performance with hardware failure.

Here, there is almost no extra hardware cost to provide the redundancy. The main disadvantage is that if a hardware failure happens during the busy hour, system will perform at a sub-optimal level until the failed module is replaced.

Network Load Balancing
Network load balancing is a different flavor of load sharing where there is no higher level processor to perform load distribution. Instead, the load distribution is achieved by hashing on the source address bits. For example, many high traffic websites perform load sharing by broadcasting the HTTP Get request over the Ethernet to all the load sharing machines. The network card on the load sharing machines are appropriately configured to pass a certain portion of the HTTP Get requests to the main computer. The remaining requests are filtered out as they will be handled by other machines. If one of the load sharing machine fails, filter settings on all the active machines are appropriately modified to redistribute the traffic

Standby Synchronization

For redundancy to work, the standby unit needs to be kept synchronized with the active unit at all times. This is required so that the standby can fit into the active's boots in case the active fails. The standby synchronization can be achieved in the following ways:

Bus Cycle Level Synchronization

In this scheme the active and the standby are locked at processor bus cycle level. To keep itself synchronized with the active unit, the standby unit watches each processor instruction that is performed by active. Then, it performs the same instruction in the next bus cycle and compares the output with that of the active unit. If the output does not match, the standby might takeover and become active. The main disadvantage here is that specialized hardware is needed to implement this scheme. Also, bus cycle level synchronization introduces wait states in bus cycle execution. This will lower the overall performance of the processor.

Memory Mirroring

Here, the system is configured with two CPUs and two parity based memory cards. One of the CPU is active and the other is standby. Both the memory cards are driven by the active CPU. No memory is attached to the standby unit. Each memory write by the active is made to both the memory cards. The data bits and the parity bits are updated individually on both the memory cards. On every memory read, the output of both the memory cards is compared. If a mismatch is detected, the processor believes the memory card with correct parity bit. The other memory card is marked suspected and a fault trigger is generated.

The standby unit continuously monitors the health of the active unit by sanity punching or watchdog mechanism. If a fault is detected, the standby takes over both the memory cards. Since the application context is kept in memory, the new active processor gets the application context.

The main disadvantage here is that specialized hardware is needed to implement this scheme. Also, memory mirroring introduces wait states in bus cycle execution. This will lower the overall performance of the processor.

Message Level Synchronization

In this scheme, active unit passes all the messages received from external sources to the standby. The standby performs all the actions as though it were active with the difference that no output is sent to the external world. The main advantage here is that no special hardware is required to implement this. The scheme is practical only in conditions where the processor is required to take fairly simple decisions. In cases of complex decisions, the synchronization can be easily lost if the two processor take different decisions on the same input message.

Checkpoint Level Synchronization

To some extent, this one is like message level synchronization as active conveys synchronization information in terms of messages to standby. The difference is that all the external world messages are not conveyed. The information is conveyed only about predefined milestones. For example, in a Call Processing system, checkpoints may be passed only when the call reaches conversation or is cleared. If standby takes over, all the calls in conversation would be retained whereas all calls in transient states will be lost. Resource information for the transient calls may be retrieved by running software audits with other modules. This scheme is not prone to loss of synchronization under normal conditions. Also, the message traffic to the standby is reduced, thus improving the overall performance of the active.

Reconciliation on Takeover

In this scheme, no synchronization between the active and the standby. When the standby takes over, it recovers the processor context by requesting information with other modules in the system. The advantage of this scheme lies in its simplicity of implementation. Also, there is no performance overhead due to mate synchronization. The disadvantage is that the standby take over may be delayed due to reconciliation requirements.