Skip to main content

Repository → 💼 AWS Well-Architected → 💼 Reliability

💼 Failure management

  • ID: /frameworks/aws-well-architected/reliability/failure-management

Description

Low-level hardware component failures are something to be dealt with every day in in an on-premises data center. In the cloud, however, you should be protected against most of these types of failures. For example, Amazon EBS volumes are placed in a specific Availability Zone where they are automatically replicated to protect you from the failure of a single component. All EBS volumes are designed for 99.999% availability. Amazon S3 objects are stored across a minimum of three Availability Zones providing 99.999999999% durability of objects over a given year. Regardless of your cloud provider, there is the potential for failures to impact your workload. Therefore, you must take steps to implement resiliency if you need your workload to be reliable.

Similar

Sub Sections

SectionSub SectionsInternal RulesPoliciesFlagsCompliance
💼 Back up data4no data
 💼 REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data from sourcesno data
 💼 REL09-BP02 Secure and encrypt backupsno data
 💼 REL09-BP03 Perform data backup automaticallyno data
 💼 REL09-BP04 Perform periodic recovery of the data to verify backup integrity and processesno data
💼 Design your workload to withstand component failures7no data
 💼 REL11-BP01 Monitor all components of the workload to detect failuresno data
 💼 REL11-BP02 Fail over to healthy resourcesno data
 💼 REL11-BP03 Automate healing on all layersno data
 💼 REL11-BP04 Rely on the data plane and not the control plane during recoveryno data
 💼 REL11-BP05 Use static stability to prevent bimodal behaviorno data
 💼 REL11-BP06 Send notifications when events impact availabilityno data
 💼 REL11-BP07 Architect your product to meet availability targets and uptime service level agreements (SLAs)no data
💼 Plan for Disaster Recovery (DR)5no data
 💼 REL13-BP01 Define recovery objectives for downtime and data lossno data
 💼 REL13-BP02 Use defined recovery strategies to meet the recovery objectivesno data
 💼 REL13-BP03 Test disaster recovery implementation to validate the implementationno data
 💼 REL13-BP04 Manage configuration drift at the DR site or Regionno data
 💼 REL13-BP05 Automate recoveryno data
💼 Test reliability5no data
 💼 REL12-BP01 Use playbooks to investigate failuresno data
 💼 REL12-BP02 Perform post-incident analysisno data
 💼 REL12-BP03 Test scalability and performance requirementsno data
 💼 REL12-BP04 Test resiliency using chaos engineeringno data
 💼 REL12-BP05 Conduct game days regularlyno data
💼 Use fault isolation to protect your workload3no data
 💼 REL10-BP01 Deploy the workload to multiple locationsno data
 💼 REL10-BP02 Automate recovery for components constrained to a single locationno data
 💼 REL10-BP03 Use bulkhead architectures to limit scope of impactno data