[HostHum.com] 12 月 8 日消息,AWS 的 US-EAST-1 区域遭遇了一次大规模宕机事故,这次宕机事故从太平洋标准时间 9 点 37 分左右开始,一直持续到傍晚才基本解决,受影响的服务包括 EC2、Connect、DynamoDB、Glue、Athena、Timestream 和 Chime 以及 US-EAST-1 中的其他 AWS 服务。

本次宕机导致了包括 Facebook、Disney Plus 以及 Alexa 在内的许多第三方使用 AWS 服务的企业业务收到严重影响,此外,亚马逊的配送业务也受到了此次宕机事故的影响,导致司机无法获取路线或需要派送的包裹,亚马逊与数千名司机之间的通讯系统中断。

在出现宕机报告后不久,亚马逊就在着手进行解决。太平洋标准时间 14 点 04 分 AWS 官方发布了声明:“我们已经执行了一项缓解措施,US-EAST-1 区域正在逐步恢复。我们将继续密切监视网络设备的运行状况,希望能尽快完全恢复。目前我们仍然没有完全恢复的预计时间。”

截止到太平洋标准时间 16 点 35 分,AWS 官方表示网络设备的问题已经解决,现在正在努力恢复所有受损的服务。

AWS 官方的状态监控面板也跟踪报告了此次服务中断事故。以下是本次事故的完整过程:

[RESOLVED] API Error Rates in US-EAST-1

[9:37 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.

[10:12 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified root cause of the issue causing service API and console issues in the US-EAST-1 Region, and are starting to see some signs of recovery. We do not have an ETA for full recovery at this time.

[11:26 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. Services impacted include: EC2, Connect, DynamoDB, Glue, Athena, Timestream, and Chime and other AWS Services in US-EAST-1. The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region. We are pursuing multiple mitigation paths in parallel, and have seen some signs of recovery, but we do not have an ETA for full recovery at this time. Root logins for consoles in all AWS regions are affected by this issue, however customers can login to consoles other than US-EAST-1 by using an IAM role for authentication.

[12:34 PM PST] We continue to experience increased API error rates for multiple AWS Services in the US-EAST-1 Region. The root cause of this issue is an impairment of several network devices. We continue to work toward mitigation, and are actively working on a number of different mitigation and resolution actions. While we have observed some early signs of recovery, we do not have an ETA for full recovery. For customers experiencing issues signing-in to the AWS Management Console in US-EAST-1, we recommend retrying using a separate Management Console endpoint (such as https://us-west-2.console.aws.amazon.com/). Additionally, if you are attempting to login using root login credentials you may be unable to do so, even via console endpoints not in US-EAST-1. If you are impacted by this, we recommend using IAM Users or Roles for authentication. We will continue to provide updates here as we have more information to share.

[2:04 PM PST] We have executed a mitigation which is showing significant recovery in the US-EAST-1 Region. We are continuing to closely monitor the health of the network devices and we expect to continue to make progress towards full recovery. We still do not have an ETA for full recovery at this time.

[2:43 PM PST] We have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. We are seeing improvement in availability across most AWS services. All services are now independently working through service-by-service recovery. We continue to work toward full recovery for all impacted AWS Services and API operations. In order to expedite overall recovery, we have temporarily disabled Event Deliveries for Amazon EventBridge in the US-EAST-1 Region. These events will still be received & accepted, and queued for later delivery.

[3:03 PM PST] Many services have already recovered, however we are working towards full recovery across services. Services like SSO, Connect, API Gateway, ECS/Fargate, and EventBridge are still experiencing impact. Engineers are actively working on resolving impact to these services.

[4:35 PM PST] With the network device issues resolved, we are now working towards recovery of any impaired services. We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard.