Amazon S3 outage breaks the internet - Website Technology Issues forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Amazon S3 outage breaks the internet

robzilla

7:42 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Well, a noticeable part of it anyway. Amazon's S3 service in the us-east-1 region has been down for at least half an hour now, affecting a whole bunch of sites and services, from Quora and Imgur to Github and Trello. Other AWS services connected to S3 buckets, such as Cloudfront distributions, are also likely to suffer. The AWS status page [status.aws.amazon.com] reports "high error rates", but the effect seems to be more like an outage. Ironically, the icons on the status page weren't showing as a result of the outage, and the "Is It Down Right Now" website is also affected.

Amazon's cloud service has outage, disrupting sites [usatoday.com]

SAN FRANCISCO — Portions of Amazon Web Services, the nation's largest cloud computing company, went offline Tuesday afternoon, affected millions of companies across the United States .

"This is a pretty big outage," said Dave Bartoletti, a cloud analyst with Forrester. "AWS had not had a lot of outages and when they happen, they're famous. People still talk about the one in September of 2015 that lasted five hours," he said.

engine

8:59 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I spotted a couple of sites with problems earlier and had to double check. One of the checking sites indicated no problems, while is it down was down. I put it down to an error enroute. Seems not.

Dimitri

9:03 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

That's the problem with giant hosts like that, the day there is an outage, immediately it impacts lot of sites.

robzilla

9:22 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Seems to be largely resolved now:

01:13 PM PST S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

Edit: spoke too soon. I cannot access my buckets via the management panel just yet, and I'm not even in the us-east-1 region. It's just backups I have there so it's not a problem.

Dimitri

9:36 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

I just realized that earlier, I placed two orders at Amazon.com, and I did not receive the notification emails, and also, the orders are not showing on my account, however my card has been charged. I wrote to Amazon customer service, and I did not receive a copy of my email, as I use to. If it's impacting Amazon.com too, this would be BIG.

robzilla

10:49 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I've read similar reports; missing orders and such. What I gather from the AWS Console right now is that just about every service offered in the us-east-1 region is or at least was having "operational issues", from S3 and EC2 to SES and Elasticache. Not Route53, thankfully. If Amazon eats its own dog food then the retail side may very well have been affected. "Not good", as Trump would say. They say most services are now recovered, though, with others still in active recovery. Looking forward to the post-mortem.

Dimitri

11:02 pm on Feb 28, 2017 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

I am sorry for those who are using Cloudflare as frontend, and AWS as backend*

*(bad joke)

NickMNS

12:28 am on Mar 1, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Any word yet on the cause? DDOS, software bug, hardware breakdown?

robzilla

9:18 am on Mar 1, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Nothing official yet. Someone tweeted [twitter.com] about an internal ticket suggesting the root cause "was a script that was run", in which case it's human error and somebody's likely to be in trouble. We'll probably hear more later in the day.

Dimitri

12:22 pm on Mar 1, 2017 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

------------------
Q: "Will this happen again?"
A: "We are confident that this issue will NOT occur again..."
------------------

I hardly imagine someone answering "yes, it will" :)

engine

12:32 pm on Mar 1, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

You can get the latest on the AWS service health dashboard, although there's not a lot to expand upon.
[status.aws.amazon.com...]

engine

6:17 pm on Mar 2, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Amazon Web Services publishes an explanation of the outages, which seems was the result of a debugging process and human error.

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. Summary of the Amazon S3 Service Disruption February 28th [aws.amazon.com]

NickMNS

6:41 pm on Mar 2, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Ahh "Fat fingers!"

robzilla

9:58 pm on Mar 2, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Happy to see they put more emphasis on their lack of safeguards than the person entering the command, and it looks like the human error shouldn't really have caused this much trouble in the first place. Must have been trying times for that engineer.

phranque

12:02 pm on Mar 3, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

rm -R *