Skip to main content

πŸ›‘οΈ AWS SageMaker Endpoint has less than 2 instances🟒

  • Contextual name: πŸ›‘οΈ Endpoint has less than 2 instances🟒
  • ID: /ce/ca/aws/sagemaker/endpoint-instance-count
  • Tags:
  • Policy Type: COMPLIANCE_POLICY
  • Policy Categories: RELIABILITY

Logic​

Similar Policies​

Description​

Open File

Description​

This policy identifies AWS SageMaker Endpoint that are not configured with at least two instances for each production variant.

Rationale​

AWS SageMaker endpoints are designed to support high availability and fault tolerance. However, these capabilities are only realized when multiple instances are provisioned for each production variant. If an instance fails or an Availability Zone becomes unavailable, SageMaker can automatically route traffic to the remaining healthy instances.

Additionally, during endpoint updates, SageMaker performs rolling or blue/green deployments. Configuring multiple instances ensures that sufficient capacity remains available to serve requests throughout the update process, minimizing service disruption.

Audit​

This policy marks an AWS SageMaker Endpoint as INCOMPLIANT when the associated AWS SageMaker Endpoint Configuration specifies an initialInstanceCount of 1 for any production variant.

Endpoints that are not in the InService state are marked as INAPPLICABLE.

Remediation​

Open File

Remediation​

Update SageMaker Endpoint Instance Count​

To remediate this finding, ensure that each production variant associated with an AWS SageMaker endpoint has at least two instances. There are two approaches to achieve this:

Option 1: Scale the Variant's Capacity​

You can increase the number of instances for the endpoint without creating a new endpoint configuration.

From Command Line​
aws sagemaker update-endpoint-weights-and-capacities \
--endpoint-name {{endpoint-name}} \
--desired-weight-and-capacities '[
{
"VariantName": "{{variant-name}}",
"DesiredInstanceCount": 2
}
]'

Notes:

  • Set DesiredInstanceCount to 2 or more to meet high-availability requirements.
  • SageMaker dynamically adjusts capacity and routes traffic automatically.
  • Monitor endpoint status and CloudWatch metrics to confirm the scaling operation completes successfully.
Option 2: Update the Endpoint with a New Configuration​

You can create a new endpoint configuration specifying multiple instances per variant and update the endpoint to use this configuration. This method leverages SageMaker’s rolling update or blue/green deployment for minimal disruption.

... see more

policy.yaml​

Open File

Linked Framework Sections​

SectionSub SectionsInternal RulesPoliciesFlagsCompliance
πŸ’Ό AWS Foundational Security Best Practices v1.0.0 β†’ πŸ’Ό [SageMaker.4] SageMaker AI endpoint production variants should have an initial instance count greater than 11no data
πŸ’Ό Cloudaware Framework β†’ πŸ’Ό System Configuration62no data
πŸ’Ό FedRAMP High Security Controls β†’ πŸ’Ό CP-10 System Recovery and Reconstitution (L)(M)(H)216no data
πŸ’Ό FedRAMP High Security Controls β†’ πŸ’Ό SC-5 Denial-of-service Protection (L)(M)(H)2no data
πŸ’Ό FedRAMP Low Security Controls β†’ πŸ’Ό CP-10 System Recovery and Reconstitution (L)(M)(H)16no data
πŸ’Ό FedRAMP Low Security Controls β†’ πŸ’Ό SC-5 Denial-of-service Protection (L)(M)(H)2no data
πŸ’Ό FedRAMP Moderate Security Controls β†’ πŸ’Ό CP-10 System Recovery and Reconstitution (L)(M)(H)116no data
πŸ’Ό FedRAMP Moderate Security Controls β†’ πŸ’Ό SC-5 Denial-of-service Protection (L)(M)(H)2no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό DE.CM-01: Networks and network services are monitored to find potentially adverse events170no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό PR.IR-01: Networks and environments are protected from unauthorized logical access and usage119no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό PR.IR-03: Mechanisms are implemented to achieve resilience requirements in normal and adverse situations19no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό RC.RP-01: The recovery portion of the incident response plan is executed once initiated from the incident response process16no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό RC.RP-02: Recovery actions are selected, scoped, prioritized, and performed16no data
πŸ’Ό NIST CSF v2.0 β†’ πŸ’Ό RC.RP-05: The integrity of restored assets is verified, systems and services are restored, and normal operating status is confirmed16no data
πŸ’Ό NIST SP 800-53 Revision 5 β†’ πŸ’Ό CP-10 System Recovery and Reconstitution616no data
πŸ’Ό NIST SP 800-53 Revision 5 β†’ πŸ’Ό SC-5 Denial-of-service Protection318no data
πŸ’Ό NIST SP 800-53 Revision 5 β†’ πŸ’Ό SC-36 Distributed Processing and Storage210no data