Welcome to WebmasterWorld Guest from 54.226.55.151

Forum Moderators: phranque

Message Too Old, No Replies

Amazon S3 outage breaks the internet

     
7:42 pm on Feb 28, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:1683
votes: 239


Well, a noticeable part of it anyway. Amazon's S3 service in the us-east-1 region has been down for at least half an hour now, affecting a whole bunch of sites and services, from Quora and Imgur to Github and Trello. Other AWS services connected to S3 buckets, such as Cloudfront distributions, are also likely to suffer. The AWS status page [status.aws.amazon.com] reports "high error rates", but the effect seems to be more like an outage. Ironically, the icons on the status page weren't showing as a result of the outage, and the "Is It Down Right Now" website is also affected.

Amazon's cloud service has outage, disrupting sites [usatoday.com]
SAN FRANCISCO Portions of Amazon Web Services, the nation's largest cloud computing company, went offline Tuesday afternoon, affected millions of companies across the United States .

"This is a pretty big outage," said Dave Bartoletti, a cloud analyst with Forrester. "AWS had not had a lot of outages and when they happen, they're famous. People still talk about the one in September of 2015 that lasted five hours," he said.
8:59 pm on Feb 28, 2017 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25055
votes: 661


I spotted a couple of sites with problems earlier and had to double check. One of the checking sites indicated no problems, while is it down was down. I put it down to an error enroute. Seems not.
9:03 pm on Feb 28, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Nov 13, 2016
posts: 348
votes: 50


That's the problem with giant hosts like that, the day there is an outage, immediately it impacts lot of sites.
9:22 pm on Feb 28, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:1683
votes: 239


Seems to be largely resolved now:
01:13 PM PST S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

Edit: spoke too soon. I cannot access my buckets via the management panel just yet, and I'm not even in the us-east-1 region. It's just backups I have there so it's not a problem.
9:36 pm on Feb 28, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Nov 13, 2016
posts: 348
votes: 50


I just realized that earlier, I placed two orders at Amazon.com, and I did not receive the notification emails, and also, the orders are not showing on my account, however my card has been charged. I wrote to Amazon customer service, and I did not receive a copy of my email, as I use to. If it's impacting Amazon.com too, this would be BIG.
10:49 pm on Feb 28, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:1683
votes: 239


I've read similar reports; missing orders and such. What I gather from the AWS Console right now is that just about every service offered in the us-east-1 region is or at least was having "operational issues", from S3 and EC2 to SES and Elasticache. Not Route53, thankfully. If Amazon eats its own dog food then the retail side may very well have been affected. "Not good", as Trump would say. They say most services are now recovered, though, with others still in active recovery. Looking forward to the post-mortem.
11:02 pm on Feb 28, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Nov 13, 2016
posts: 348
votes: 50


I am sorry for those who are using Cloudflare as frontend, and AWS as backend*


*(bad joke)
12:28 am on Mar 1, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1592
votes: 424


Any word yet on the cause? DDOS, software bug, hardware breakdown?
9:18 am on Mar 1, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:1683
votes: 239


Nothing official yet. Someone tweeted [twitter.com] about an internal ticket suggesting the root cause "was a script that was run", in which case it's human error and somebody's likely to be in trouble. We'll probably hear more later in the day.
12:22 pm on Mar 1, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Nov 13, 2016
posts: 348
votes: 50


------------------
Q: "Will this happen again?"
A: "We are confident that this issue will NOT occur again..."
------------------

I hardly imagine someone answering "yes, it will" :)
12:32 pm on Mar 1, 2017 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25055
votes: 661


You can get the latest on the AWS service health dashboard, although there's not a lot to expand upon.
[status.aws.amazon.com...]
6:17 pm on Mar 2, 2017 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25055
votes: 661


Amazon Web Services publishes an explanation of the outages, which seems was the result of a debugging process and human error.

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. Summary of the Amazon S3 Service Disruption February 28th [aws.amazon.com]
6:41 pm on Mar 2, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1592
votes: 424


Ahh "Fat fingers!"
9:58 pm on Mar 2, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:1683
votes: 239


Happy to see they put more emphasis on their lack of safeguards than the person entering the command, and it looks like the human error shouldn't really have caused this much trouble in the first place. Must have been trying times for that engineer.
12:02 pm on Mar 3, 2017 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11167
votes: 116


rm -R *