Theplanet's data centre in Houston catches fire

Forum Moderators: phranque

Message Too Old, No Replies

Theplanet's data centre in Houston catches fire

Planet's main data centre in Houston catches fire 1000s servers down

dusky

2:38 am on Jun 1, 2008 (gmt 0)

Hundreds if not thousands of servers are down due fire at one of theplanet's main data centres, main announcement at their forums here [forums.theplanet.com...] Apparently, no data was lost or damaged but I have the feeling that it's big and they are not sure!

The larger dedicated server provider in the world, I am sure millions of 404s every second from 100s of thousands of sites, this will undoubtedly be a test to see how quickly they can respond to the situation and get servers back online. it has been few hours so fare!

[newsblaze.com...]

[edited by: Brett_Tabke at 11:34 am (utc) on June 1, 2008]
[edit reason] added link [/edit]

httpwebwitch

1:46 pm on Jun 3, 2008 (gmt 0)

This is the best reaction / web service interruption apology I encountered regarding the big Fire:

An explosion ate our servers - how cool is that?
(snip)
An explosion. How cool is that? It's way cooler than having a guy accidentally dig up your wiring with a backhoe, or run it over with a truck. It's an outright privilege compared with the squirrel that ate the fiber cable. Imagine spending nights and weekends rallying your troops around a disaster recovery plan, all to do battle with a squirrel. That would be humiliating. My wall of flame incinerates your squirrel.
(snip)
source:www.assembla.com

I wish I could quote the whole thing, cuz it made me lawl.

juliocrd

5:25 pm on Jun 3, 2008 (gmt 0)

Hi there, and sorry about the hosting provider question. Well I am also surprised because it seems like the fire isn’t such a big deal, but actually it is for those affected. Our websites have been down for almost 3 days now; yesterday they worked just for a couple hours and now they’re down again. Well we’ll just have to wait; at least this has made us work to try out our DRPs.

rollinj

6:21 pm on Jun 3, 2008 (gmt 0)

Anyone know if we're entitled to refunds? I know I didn't start that fire..

Dabrowski

6:31 pm on Jun 3, 2008 (gmt 0)

ok, I run my own servers and this is my setup....

I have 3 sites, in 3 different geographical areas (although in the UK that doesn't really mean too much), using 2 different ISP's.

Apart from my domain name registration - I do everything in house. That way I can control everything.

The sites are connected via SDSL on hardware VPN routers which are all connected to each other in a triangle.

Each server is a DNS server, all servers are mirrored and the DNS uses 'round robin' to load balance. That is, each server address is given out in turn by each server.

Web data is mirrored, and I'm using MySQL with circular replication, so all data is (almost) synchronously replicated.

On top of that, each server backs up electronically to the next, and I have one physical copy which is taken off site.

Should 1 (or 2) sites fail, all I'd need to do is delete one host record from each DNS zone and the change would be immediate. As soon as the link is restored, or the server replaced, replication would kick in and away it goes.

Should all 3 sites fail, I have backup servers (offline) that can be switched over, backup restored and running in a couple of hours.

g1smd

11:00 pm on Jun 3, 2008 (gmt 0)

Have you actually pulled the wire out, and tested what actually happens next?

incrediBILL

11:49 pm on Jun 3, 2008 (gmt 0)

Each server is a DNS server, all servers are mirrored and the DNS uses 'round robin' to load balance. That is, each server address is given out in turn by each server.

Unless I'm mistaken and you have accounted for this problem, doesn't that configuration mean when one server goes down 33% of your visitors will be unable to reconnect for a period of time?

Dabrowski

12:00 am on Jun 4, 2008 (gmt 0)

Have you actually pulled the wire out, and tested what actually happens next?

Unless I'm mistaken and you have accounted for this problem, doesn't that configuration mean when one server goes down 33% of your visitors will be unable to reconnect for a period of time?

To answer both of these questions....

Yes I have. The round robin kicks in and (once the cache time has expired (see below) ), all is well once again.

Regarding DNS cache times, if a host has performed a DNS query it will cache the result locally. The cache time varies, but I believe it's around 10 minutes or less for most applications and clients, but may be longer on ISP's DNS servers.

So, yes 33% is correct, but only of visitors that have performed a lookup recently.

Unfortunately, there is no way to get around this, without DNS cache I'm sure the internet would be a much slower place, but it is quicker then changing name servers and waiting for propogation.

Dabrowski

12:02 am on Jun 4, 2008 (gmt 0)

Unfortunately, there is no way to get around this

Actually, I'll correct myself here, it is possible with a server cluster of remote nodes working with a single virtual IP.

That would be seamless, however that is currently beyond my capability, but is on my 'todo' list.

physics

12:42 am on Jun 4, 2008 (gmt 0)

Remember, the title is: Theplanet's data centre in Houston catches fire.

Let's try to keep this thread on topic with that (and not digress too much into how to create your own 'planet' ;) ).

bsterz

1:50 pm on Jun 4, 2008 (gmt 0)

I wonder if it's appropriate to discuss what they did well and what they did poorly regarding this outage? I have been affected in a big way, so I certainly have some opinions..

I would like to avoid rants or speculations (such as 'the guy didn't even seem to care..' types of comments)

Mod - feel free to nuke if this is not fitting for this thread..

Did well:
1. Kept chat working even under a crushing load
2. When tickets started getting moldy, if I chatted to ask, I never got a default response, the chat tech always checked on the ticket and sometimes nudged it along for me
3. Helped with config changes on my servers where needed
4. Were respectful and patient (to a degree befitting the situation)
5. Kept updates coming out - I was refreshing that page every 15 minutes or so as I had customers waiting to hear from me - so I really appreciated the constant info
6. Seemed to be fast on their feet in acquiring name servers, equipment, etc
7. Did generally try to help rather than pushing me off to the next guy

Could have done better:
1. Could have pushed out config suggestions more quickly. I'm pretty sure they knew about many of the issues I would run into such as the resolvers dying and could have been a bit quicker on pushing out the info
2. Could have been a bit more forthcoming about the DNS servers dying. This was kinda downplayed IMO and was a major issue for me and my customers. 5 days after fire and I still don't have access to my zone files.
3. Could have pushed a customer chat/discussion area where we could share our ideas. Many customers don't know about the forums and would have benefited from some peer assistance.
4. A phone call about the outage on Saturday would have been great. They have my number in the admin area, and it would have really changed my outlook if they would have told me things were going to get a bit bumpy.
5. Quick & regular acknowledgements of tickets would have been great. Some tickets waited days without any updates. Just a simple - 'we are aware of this ticket and it's in the queue' would have been good, and they could have paid an intern to do it

For my part, I've learned a bit too. I'm now creating a comprehensive phone list of those clients of mine that would be affected by future outages. I plan to call them if this happens again so they don't feel blindsided by me when their sites go down
I'm also formulating a more robust strategy for when this happens. I don't think I'm charging enough to many clients as a 'real' strategy is expensive, so I'm re-evaluating my hosting price schedule. I want to provide better service than the other guy and I know that's not cheap

Overall I'm happy with their response and as I said in a previous post I feel better about them than I did before the outage. I was actually formulating an exit strategy last month, but I'm now considering keeping my major servers with them. I'm moving mail and DNS elsewhere however as this has taught me that you can't keep all your eggs in one basket

Your thoughts? Honestly, I'm looking also for ways I could have better served my customers during this outage, so any perspectives are welcome :)

g1smd

2:10 pm on Jun 4, 2008 (gmt 0)

A lot of the "downside" stuff you list are things that needed resources -- human resources. Calling all of their customers within a few hours would have needed hundreds of people, resources they probably haven't got free at the time.

Even for the people updating status, a team of several people probably couldn't type fast enough to list out every problem that was being fixed, as it was fixed.

Some issues might not have been pertinent to mention in case systems were being hot-wired to get them back on, but with some security aspects not fully implemented. Don't let potential hackers have too much information.

Ticket processing. Even a team of a hundred people could take many hours to read tens of thousands of tickets, without even acting on any of them. Adding on the time making calls to technicians that are already trying to rewire and configure things, takes the response time through the roof.

I'd guess that they were wishing they had ten times as many staff at present.

coopster

2:17 pm on Jun 4, 2008 (gmt 0)

I don't use Planet myself, but am looking into something for somebody else right now. Has anybody else checked their reverse DNS since this mishap occurred?

bsterz

2:35 pm on Jun 4, 2008 (gmt 0)

g1smd - I agree with everything you said..but with that said..

I'll bet folks over there are saying to their staff something like this:

"I want to hear ideas - ANY ideas of how we could have mitigated this issue".

Given the extreme solutions they went to technically, I'll bet with a little planning and partnering they could have some more human solutions on deck for the next crisis. The folks they have did a great job, but some 'perception' issues could be avoided.

For example - how about an emergency contingency arrangment with a call center? Something like "I'll give you $xx per month to be ready to make 5,000 calls at the drop of a hat". Or how about an automated calling system (* comes to mind)? Heck I have one and I'm TINY.

Again - I'm not unhappy with TP - quite the converse, but we should explore our perceptions of what could be improved from a 'blue sky' perspective - that's what we're doing over here. We can't DO all of the things, but we're surely exploring them.

Bill

bsterz

2:39 pm on Jun 4, 2008 (gmt 0)

Coopster - I'm actually afraid to check - DNS has been the largest issue for me from this outage. In fact - my main server wasn't even in the affected DC, but the DNS failures caused me quite a scramble.

coopster

3:04 pm on Jun 4, 2008 (gmt 0)

Not DNS, the reverse DNS or rDNS (ARPA). They are now fixed. 1 hour ago they were pointing to .ev1servers.net. Using an example IP:

;; QUESTION SECTION: 
;1.0.0.127.in-addr.arpa.   IN   PTR 
;; ANSWER SECTION: 
1.0.0.127.in-addr.arpa. 86400 IN   PTR   ev1s-127-0-0-1.ev1servers.net.

Interesting, to say the least.

himalayaswater

5:32 pm on Jun 4, 2008 (gmt 0)

Dabrowski,

You need to use LVS for virtual ip management. You also need central storage so that user can see same dynamic or static page. You need to configure MySQL or any other db server in master / slave mode (there are many modes but this is easy to setup). You need to use mirroring file system to keep everything identical on all nodes using DRDB.

This is the general replication and HA technology used by many. It is built on commodity computing concept. You see this kind of technology in action when you hit google dot com (the only difference is google uses modified version of many open source project and they have deadly alogs for replication).

Dabrowski

9:31 pm on Jun 4, 2008 (gmt 0)

how about an emergency contingency arrangment with a call center? Something like "I'll give you $xx per month to be ready to make 5,000 calls at the drop of a hat".

Because it's easier to post a message to one forum and get 5,000 people to read it. You said yourself you were impressed with the feedback and updates.

Quick & regular acknowledgements of tickets would have been great

How long would it take you to answer hundreds of tickets personally that all pretty much say "I'm having trouble accessing my website".

himalayaswater, thanks for the info. MS have a certification on cluster services so I'll probably try to do that this year.

bsterz

10:57 pm on Jun 4, 2008 (gmt 0)

Because it's easier to post a message to one forum and get 5,000 people to read it. You said yourself you were impressed with the feedback and updates.

I guess I wasn't thinking of what's easier, but what's better for the client. Hosting is getting pretty tight, and those that go for maximum effort (rather than easier) will win the day.

For example, I was a couple of miles in the air heading to Denver when my sites went down. A phone call would have gotten to me as soon as I landed. I then would have made a call to my partner who could have done some damage control. I wouldn't check network access forums unless I thought there was a problem.

Again - not to dish on TP - I'm just thinking of things that could have made it better. I would want my clients to tell me what I could have done better during the outage.

venti

4:40 am on Jun 5, 2008 (gmt 0)

bsterz, I agree especially with the fact that TP offers a notification system to be setup if your server becomes unresponsive (can be setup in orbit). I didn't get a call. I don't even care if it's an automated message. I don't check the forums on a Saturday night but would pick up the cell.

himalayaswater

8:51 pm on Jun 5, 2008 (gmt 0)

bsterz / venti,

I'm surprised to find that you don't have 3rd party monitoring system. I've account with Pingdom for monitor web site and other infrastructure such as DNS. It monitor my server from 8 to 10 location world wild and instantly let me know when anything goes wrong via text and email message. I can measure both uptime and downtime using their system.

MLHmptn

9:13 pm on Jun 5, 2008 (gmt 0)

WOW! TP has come through! Now that is service and a SLA like I will most likely never see again! 2 months free service! Sure it cost me some downtime but what other hosting provider would give 2 months service credit let alone 1! I'm not one to sugarcoat things but this is impressive in this day and age.

[edited by: MLHmptn at 9:19 pm (utc) on June 5, 2008]

venti

10:09 pm on Jun 5, 2008 (gmt 0)

himalayaswater,

We use pingdom as well. The problem was we used the http text, https, email, and dns monitors. The servers that went down were dedicated database servers and not all of them went down so the site was still chugging along. We assumed the ping services at TP would cover those.

bsterz

1:59 am on Jun 6, 2008 (gmt 0)

WHOAH! Yes I just got the notice. Now THAT'S an apology :)

bsterz

2:46 pm on Jun 6, 2008 (gmt 0)

Am I figuring this right..is this roughly over 2 MILLION bucks to TP? I'm just looking at 9000 servers at around $250 per..

httpwebwitch

4:41 pm on Jun 6, 2008 (gmt 0)

gosh, that's lovely, and generous. Good show, TP!
I hope the (bsterz estimated) $2M is covered by insurance! ;)

DamonHD

5:56 pm on Jun 6, 2008 (gmt 0)

Sometimes, if you you didn't skimp on the business interruption policy and premiums, you can come out ahead.

A client of mine lost their building in 9/11 and was not unduly hurt financially and indeed my have been up on the deal IIRC.

Rgds

Damon

rollinj

10:08 pm on Jun 7, 2008 (gmt 0)

I've already switched providers.. they are STILL giving excuses / very slow response times as to why my sites are still timing out..!

Vasili

9:12 am on Jun 10, 2008 (gmt 0)

Problem not totally solved, as they are rotating service standards between clients and peak hours according to client size and importance, it seems....

And, for many US-based eCommerce publishers, being physically hosted in the US is an issue, both for compliance (Patriot Act) and simple management and response commitments.

This 88 message thread spans 3 pages: 88