Forum Moderators: phranque
SAN FRANCISCO — Portions of Amazon Web Services, the nation's largest cloud computing company, went offline Tuesday afternoon, affected millions of companies across the United States .
"This is a pretty big outage," said Dave Bartoletti, a cloud analyst with Forrester. "AWS had not had a lot of outages and when they happen, they're famous. People still talk about the one in September of 2015 that lasted five hours," he said.
01:13 PM PST S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. Summary of the Amazon S3 Service Disruption February 28th [aws.amazon.com]