Fault Tolerance Techniques: An Easy Guide For 2021

img
Ajay Ohri
Share

Introduction

Working in the era of โ€˜Computer Systemsโ€™ is not easy, neither for the users and nor for the developers. A well-designed computer system can serve the best purposes, but it is also true that no system can be designed error-free. They are prone to errors, bugs, and faults that can disrupt the user’s ultimate functioning, thereby compromising its productivity and efficiency of the results.

โ€˜Fault Toleranceโ€™ is the feature incorporated in the system that enables its smooth functioning even after a failure occurs in some of its components. A Fault Tolerant design may cause a reduction in productivity level or increased response time, etc. However, it makes sure that the entire system doesnโ€™t fail. Hence, in short, it works as a coping mechanism in a system aiming at self-stabilization.

Early in technology, fault tolerant systems were designed to give the user or operator alarms about the possible failure. The operator was supposed to act over the alarm and get things straight before a major break-down occurred. This involved human interference. However, today things have changed. Systems, whether hardware or software, are designed to resolve issues independently without much human interference unless itโ€™s a major issue requiring immediate attention. 

In this article let us look at:

  1. Types of Faults
  2. Fault Tolerance Techniques and Methods
  3. Hardware Fault-tolerance Techniques
  4. Software Fault-tolerance Techniques
  5. Fault-tolerance in Cloud Computing

1. Types of Faults

Failures can be safe as well as deadly. The elevators working at a slow pace with dim lights when the main power grids supply cuts off is also called Graceful Degradation. Progressive enhancement is when computing is hampered due to failures. For example, the website pages getting loaded in the basic version when internet connectivity is weak.

2. Fault Tolerance Techniques and methods

Computer System Fault tolerance is taken care of at two levels. The Hardware fault tolerance and the Software fault tolerance. Hardware fault tolerance is much easy to deal with than Software fault tolerance. Fault-tolerance techniques require deep knowledge and interdisciplinary work, and a huge critical examination of the systems and their functioning. Any up-gradation may require huge costs and time and may also increase or decrease the size, weight, and design of the system depending on the complexity involved.

3. Hardware Fault-tolerance techniques

  • BIST (Build in Self-test)

This technique empowers the system to carry out tests at specific intervals to evaluate any faulty propagations. Whenever it signals any fault, it configures itself to switch out the faulty component and switch in its redundant instead.

  • TMR (Triple Modular Redundancy)

In this technique, three redundant copies of a faulty component are generated and are run simultaneously. Voting is performed for their performance, and the majority of votes are selected. It can tolerate a single fault at a time.

  • Circuit Breaker

This is a circuit design that enables breaking the circuit to avoid catastrophic failures in distributed systems.

4. Software Fault-tolerance Techniques

These techniques, if implemented, help make the software more reliable.

  • N-Version Programming

In this technique, n versions of a program are developed by n developers. All these copies are run simultaneously, and the one with the most fault tolerance is selected. This is a fault-detection technique used at the developing stage of the software.

  • Recovery Blocks

This technique is somewhat the same as above, except for the redundant copies are not run simultaneously. They are run one by one and are generated with a different set of algorithms. This technique is used where task deadlines are more than the computation time.

  • Check-pointing and Roll-back recovery

Through this technique, the system is tested each time a computation is needed to perform.

  • Failure-Oblivious Computing

This technique enables computer programs to continue execution despite errors. It handles invalid memory reads by returning manufactured value to the program where, in turn, the program considers this new value and ignores the former value in its memory. This is something unlikely to the earlier memory checks, which aborted the programs for invalid inputs.

  • Recovery Shepherding

This technique works with the just-in-time binary framework pin. It attaches to the application process, analyses the error, notes the repairs, and tracks the effects of the repair and detaches from the application program once all the repair effects are removed from the program. All this occurs at the back-end while the program functions in its normal state and does not hamper its usual execution.

5. Fault-tolerance in Cloud Computing

Cloud computing is a space that enables robust performances without having to worry about the components. It is a service built on the concept of virtualization.

  • Reconfiguration eliminates the faulty component from the system.
  • Check-pointing enables the continuation of a task from where it was interrupted.
  • Job Migration enables the migration of a failed task to a different system.

Conclusion

As are the advantages of a fault tolerance technique, so are its disadvantages as well.

The biggest disadvantage is when a fault tolerance in one component curtails another component’s performance, which is dependent on it. Any such fault-tolerance will lead to the production of inferior products and increase costs in the long-run.

Jigsaw Academyโ€™s Postgraduate Certificate Program In Cloud Computing brings Cloud aspirants closer to their dream jobs. The joint-certification course is 6 months long and is conducted online and will help you become a complete Cloud Professional.

ALSO READ

Related Articles

loader
Please wait while your application is being created.
Request Callback