Forum Moderators: phranque
The larger dedicated server provider in the world, I am sure millions of 404s every second from 100s of thousands of sites, this will undoubtedly be a test to see how quickly they can respond to the situation and get servers back online. it has been few hours so fare!
[newsblaze.com...]
[edited by: Brett_Tabke at 11:34 am (utc) on June 1, 2008]
[edit reason] added link [/edit]
(complete text available at theplanet forums linked in the original post above)
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room.
...
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday.
...
[edited by: MLHmptn at 5:23 am (utc) on June 1, 2008]
[edited by: phranque at 9:57 am (utc) on June 1, 2008]
[edit reason] edited for fair use [/edit]
Oh, everyone if you are affected, don't forget to pause your PPC ads such as adwords etc.
warra, warra day!
I have a plan in place right now but it is not adquate as I don't have complete access to the backup so...
I have set my mind to moving my main domain to another register and buying a cheap hosting package to set up an emergency plan that all I will need to do is change name servers on the domain.
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room.
Luckily I am not a customer any more, however if I were I'd want to know why the increased load to the cooling system which must have preceded this for quite some time went ignored.
An explosion strong enough to knock down three walls, to my knowledge, only comes from a well established fire providing furnace-like conditions.
I've seen that, and it's impressive!
I wish EV1 and their customers the best of luck with a speedy recovery!
Jim
That being said, I wanted to establish an alternate host to be able to test some programming/db changes we want to make. It's turned out to be a lot more involved than I thought. I'm running into stuff like trying to get zend optimizer configured, getting into the nitty gritty of the existing oscommerce site and modding the config files and a lot of other boring stuff.
Once we get the alternate host working, we should be able to just switch the dns pointers and reload a backup. However, getting to that point is pretty involved. My guess is that trying to do something like this for dedicated servers might be even more inovlved.
I'm just glad that I'm not under pressure to get the second site working with the first one being down.
chris
We have initiated our own dr plan and already have our mirror up and running on the other side of the country, just waiting on our downtime timeline to expire before we switch dns out.
Sadly many people have dns through them which means this is not a option as the core ev5 ev6 dns system is out as well effecting not only onsite servers but thousands of offsite servers as dns expires globally.
An explosion strong enough to knock down three walls, to my knowledge, only comes from a well established fire providing furnace-like conditions.
If the insulation on the cables had been smoldering and out gassing for a while, and then the fumes were lit by a short circuit, you could see a pretty significant "event".
Other good candidate for the explosion: batteries. If the fire was in their UPS room, depending on the chemical makeup of the batteries they were using, and the quantity, (most likely thousands of pounds worth of batteries), you could get anything from a small "whump" to something just short of Hiroshima. Lead acid batteries (like the one in your car) can detonate pretty spectacularly under the right conditions, and are very common in large data centers and telco switches (cheap and efficient), and if you've ever seen a li-ion laptop battery have a thermal runaway reaction [theinquirer.net], you'd prolly never rest a laptop on your crotch again.
Anyway, good luck to those affected. Hopefully they'll have backup servers online quickly, and nobody loses too much data.
edit add: youtube video [youtube.com] of laptop lion in thermal runaway
[edited by: grelmar at 4:54 pm (utc) on June 1, 2008]
To keep you up-to-date, here is the latest information about the outage in our H1 data center.We expect to be able to provide initial power to parts of the H1 data center beginning at 5:00 p.m. CDT. At that time, we will begin testing and validating network and power systems, turning on air-conditioning systems and monitoring environmental conditions. We expect this testing to last approximately four hours.
Following this testing, we will begin to power-on customer servers in phases. These are approximate times, and as we know more, we will keep you apprised of the situation.
We will update you again around 2:30 p.m. this afternoon.
[forums.theplanet.com...]
Both of our E-Commerce websites are down, and have been since 5PM yesterday. Completely unacceptable.
Never blame the host, most big hosts do a really good job at infrastructure but there's always the unexpected, including natural disasters, no matter how well you plan and design your data center.
Now you know you need a contingency plan.
It doesn't cost much to have a lesser powered backup server on standby with a carbon copy of your site ready to go live at a moments notice. It can even be kept up-to-date using a cron job with RSync (on Linux) so you've only lost an hours worth of data (orders) worse case. I also keep a daily backup locally just in case all of my best laid plans fail simultaneously, I can still upload to a new server elsewhere from scratch.
Besides, there are many other reasons to do this such as complete hardware failure, getting hacked, having major routers go down outside the data center, etc.
A few years ago a main router at Level 3 blew out in the Los Angeles region and it cut off a large chunk of the state and it took 6 hours to restore because there was only 1 backup part and it was a 3 hour drive away (and back) and for some odd reason there was nobody where the part was located that could drive it to LA, sounds stupid I know, but that's how big business operates.
So having a backup server in another part of the country is generally a good idea.
While you're at it, put your server on a 3rd party dynamic DNS service so you can move your site from server to server in mere seconds.
In these IntarWeb days that's not so hard to do!
If one server or host or DNS server dies or makes some horrible technical SNAFU or backbone carriers are having another tiff, etc, you only lose a portion of your income.
And of course if you have smart DNS and front-ends and so on like (say) the BBC or Google do, then almost immediately all traffic can be diverted away from the failed parts and (almost) no users at all see any loss of service for more than at most a few minutes as determined by your DNS timeouts.
I have my relatively insignificant sites and DNS spread over 6 sites in 5 countries in the US, UK and AsiaPac, and am usually not too panicked when something fails (like my entire host in Atlanta losing power and UPS for many hours recently, or the disc failing in my main UK co-lo server). I can have cheaper hardware in each location too, since I'm relying on multiple machines to reduce the impact of any one failure. I can be ... ahem ... highly-strung, and this approach has probably been my biggest single investment in putting off cardiac failure! B^>
Rgds
Damon