⭐ Repository → 💼 AWS Well-Architected → 💼 Reliability

💼 Failure management

ID: /frameworks/aws-well-architected/reliability/failure-management

Description

Low-level hardware component failures are something to be dealt with every day in in an on-premises data center. In the cloud, however, you should be protected against most of these types of failures. For example, Amazon EBS volumes are placed in a specific Availability Zone where they are automatically replicated to protect you from the failure of a single component. All EBS volumes are designed for 99.999% availability. Amazon S3 objects are stored across a minimum of three Availability Zones providing 99.999999999% durability of objects over a given year. Regardless of your cloud provider, there is the potential for failures to impact your workload. Therefore, you must take steps to implement resiliency if you need your workload to be reliable.

Similar

Sub Sections

Section	Sub Sections	Compliance
💼 Back up data	4	no data
💼 REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data from sources		no data
💼 REL09-BP02 Secure and encrypt backups		no data
💼 REL09-BP03 Perform data backup automatically		no data
💼 REL09-BP04 Perform periodic recovery of the data to verify backup integrity and processes		no data
💼 Design your workload to withstand component failures	7	no data
💼 REL11-BP01 Monitor all components of the workload to detect failures		no data
💼 REL11-BP02 Fail over to healthy resources		no data
💼 REL11-BP03 Automate healing on all layers		no data
💼 REL11-BP04 Rely on the data plane and not the control plane during recovery		no data
💼 REL11-BP05 Use static stability to prevent bimodal behavior		no data
💼 REL11-BP06 Send notifications when events impact availability		no data
💼 REL11-BP07 Architect your product to meet availability targets and uptime service level agreements (SLAs)		no data
💼 Plan for Disaster Recovery (DR)	5	no data
💼 REL13-BP01 Define recovery objectives for downtime and data loss		no data
💼 REL13-BP02 Use defined recovery strategies to meet the recovery objectives		no data
💼 REL13-BP03 Test disaster recovery implementation to validate the implementation		no data
💼 REL13-BP04 Manage configuration drift at the DR site or Region		no data
💼 REL13-BP05 Automate recovery		no data
💼 Test reliability	5	no data
💼 REL12-BP01 Use playbooks to investigate failures		no data
💼 REL12-BP02 Perform post-incident analysis		no data
💼 REL12-BP03 Test scalability and performance requirements		no data
💼 REL12-BP04 Test resiliency using chaos engineering		no data
💼 REL12-BP05 Conduct game days regularly		no data
💼 Use fault isolation to protect your workload	3	no data
💼 REL10-BP01 Deploy the workload to multiple locations		no data
💼 REL10-BP02 Automate recovery for components constrained to a single location		no data
💼 REL10-BP03 Use bulkhead architectures to limit scope of impact		no data

Description​

Similar​

Sub Sections​

Description

Similar

Sub Sections