Fault Tolerance

Become Able to Deal with Service Issues

Overview

In order to continue providing services with little downtime, it's necessary to incorporate a system that can continue operating properly even if a part of the system fails. It is also important to reduce the frequency of system failures and to be able to recover quickly. Such a system is known as a fault-tolerant system.

Redundancy is an effective way to continue operating even if a part of the system fails. Redundancy is a configuration where multiple servers with the same role are prepared so that even if one server fails, the other servers can continue operating. Since the operation is performed on multiple servers, it also improves processing capacity.

Preventive maintenance is effective in reducing the frequency of failures. Specifically, this includes monitoring memory and processes to prevent failures in advance at the first sign of a problem, reorganizing indexes to improve database performance, and replacing aging servers and other equipment.

To recover from a failure, it is essential to understand the cause of the failure from the log file and respond to it as quickly as possible. At this time, it would be helpful to agree on the work procedures from investigation to recovery in advance and ensure smooth progress.

After recovery, a thorough investigation of the cause of the problem will make the system more resilient to failures, and as a result, services can be provided in a more stable manner.

Fault Tolerance

Learning Objectives

If we can build a system that is less prone to service interruptions due to failures, we can deliver stable services to users.

When an application is released, the server load and data volume will increase as the number of users increases, and the response time becomes slower and slower, which may lead to a failure if it accumulates. However, since it's difficult to eliminate all failure factors before they occur, developers need to make various preparations in anticipation of a failure.

Build a redundant system so that you can continue to serve your users while responding to failures. It is also important to incorporate preventive maintenance to reduce the frequency of failures as much as possible.

Let's learn about how to build a fault-tolerant system.

Learn from Here

Ideally, failure-related skills should be learned while actually experiencing operation and maintenance. Here, let's aim to get an idea of what failure-related skills are.

  • Understanding the Basics of High-Availability

    Failure Prevention

    Quick Recovery

    Establish Rules for Dealing with Failures and Effort for Continuous Improvement

  • Understanding the Basics of Operation and Maintenance

    System Monitoring

    Job Management

    Backup Management

Recommended Materials

  1. Fault-Tolerant Components on AWS - Fault-Tolerant Components on AWS

Share what you learn today!