Fault Tolerance

Overview

In order to continue providing services with little downtime, it's necessary to incorporate a system that can continue operating properly even if a part of the system fails. It is also important to reduce the frequency of system failures and to be able to recover quickly. Such a system is known as a fault-tolerant system.

Redundancy is an effective way to continue operating even if a part of the system fails. Redundancy is a configuration where multiple servers with the same role are prepared so that even if one server fails, the other servers can continue operating. Since the operation is performed on multiple servers, it also improves processing capacity.

Preventive maintenance is effective in reducing the frequency of failures. Specifically, this includes monitoring memory and processes to prevent failures in advance at the first sign of a problem, reorganizing indexes to improve database performance, and replacing aging servers and other equipment.

To recover from a failure, it is essential to understand the cause of the failure from the log file and respond to it as quickly as possible. At this time, it would be helpful to agree on the work procedures from investigation to recovery in advance and ensure smooth progress.

After recovery, a thorough investigation of the cause of the problem will make the system more resilient to failures, and as a result, services can be provided in a more stable manner.

Learning Objectives

If we can build a system that is less prone to service interruptions due to failures, we can deliver stable services to users.

When an application is released, the server load and data volume will increase as the number of users increases, and the response time becomes slower and slower, which may lead to a failure if it accumulates. However, since it's difficult to eliminate all failure factors before they occur, developers need to make various preparations in anticipation of a failure.

Build a redundant system so that you can continue to serve your users while responding to failures. It is also important to incorporate preventive maintenance to reduce the frequency of failures as much as possible.

Let's learn about how to build a fault-tolerant system.

Learn from Here

Ideally, failure-related skills should be learned while actually experiencing operation and maintenance. Here, let's aim to get an idea of what failure-related skills are.

Understanding the Basics of High-Availability
Failure Prevention
Quick Recovery
Establish Rules for Dealing with Failures and Effort for Continuous Improvement
Understanding the Basics of Operation and Maintenance
System Monitoring
Job Management
Backup Management

Recommended Materials

Fault-Tolerant Components on AWS - Fault-Tolerant Components on AWS
This is an official document that introduces how to build a fault-tolerant system using AWS. Refer to the document to build a fault-tolerant system.
Fault-Tolerant Components on AWS - Fault-Tolerant Components on AWS
Notice: This whitepaper has been archived. For the latest technical information on Storage Solutions, see the AWS Whitepapers & Guides page: https://aws.amazon.com/whitepapers/ .
docs.aws.amazon.com

Overview

Learning Objectives

Learn from Here

Level 1Beginner Level

Level 2Intermediate Level

Understanding the Basics of High-Availability

Understanding the Basics of Operation and Maintenance

Recommended Materials

Fault-Tolerant Components on AWS - Fault-Tolerant Components on AWS

Fault-Tolerant Components on AWS - Fault-Tolerant Components on AWS

Share what you learn today!