Forum Moderators: phranque
The larger dedicated server provider in the world, I am sure millions of 404s every second from 100s of thousands of sites, this will undoubtedly be a test to see how quickly they can respond to the situation and get servers back online. it has been few hours so fare!
[newsblaze.com...]
[edited by: Brett_Tabke at 11:34 am (utc) on June 1, 2008]
[edit reason] added link [/edit]
An explosion ate our servers - how cool is that?
(snip)
An explosion. How cool is that? It's way cooler than having a guy accidentally dig up your wiring with a backhoe, or run it over with a truck. It's an outright privilege compared with the squirrel that ate the fiber cable. Imagine spending nights and weekends rallying your troops around a disaster recovery plan, all to do battle with a squirrel. That would be humiliating. My wall of flame incinerates your squirrel.
(snip)
source:www.assembla.com
I wish I could quote the whole thing, cuz it made me lawl.
I have 3 sites, in 3 different geographical areas (although in the UK that doesn't really mean too much), using 2 different ISP's.
Apart from my domain name registration - I do everything in house. That way I can control everything.
The sites are connected via SDSL on hardware VPN routers which are all connected to each other in a triangle.
Each server is a DNS server, all servers are mirrored and the DNS uses 'round robin' to load balance. That is, each server address is given out in turn by each server.
Web data is mirrored, and I'm using MySQL with circular replication, so all data is (almost) synchronously replicated.
On top of that, each server backs up electronically to the next, and I have one physical copy which is taken off site.
Should 1 (or 2) sites fail, all I'd need to do is delete one host record from each DNS zone and the change would be immediate. As soon as the link is restored, or the server replaced, replication would kick in and away it goes.
Should all 3 sites fail, I have backup servers (offline) that can be switched over, backup restored and running in a couple of hours.
Each server is a DNS server, all servers are mirrored and the DNS uses 'round robin' to load balance. That is, each server address is given out in turn by each server.
Unless I'm mistaken and you have accounted for this problem, doesn't that configuration mean when one server goes down 33% of your visitors will be unable to reconnect for a period of time?
Have you actually pulled the wire out, and tested what actually happens next?
Unless I'm mistaken and you have accounted for this problem, doesn't that configuration mean when one server goes down 33% of your visitors will be unable to reconnect for a period of time?
To answer both of these questions....
Yes I have. The round robin kicks in and (once the cache time has expired (see below) ), all is well once again.
Regarding DNS cache times, if a host has performed a DNS query it will cache the result locally. The cache time varies, but I believe it's around 10 minutes or less for most applications and clients, but may be longer on ISP's DNS servers.
So, yes 33% is correct, but only of visitors that have performed a lookup recently.
Unfortunately, there is no way to get around this, without DNS cache I'm sure the internet would be a much slower place, but it is quicker then changing name servers and waiting for propogation.
I would like to avoid rants or speculations (such as 'the guy didn't even seem to care..' types of comments)
Mod - feel free to nuke if this is not fitting for this thread..
Did well:
1. Kept chat working even under a crushing load
2. When tickets started getting moldy, if I chatted to ask, I never got a default response, the chat tech always checked on the ticket and sometimes nudged it along for me
3. Helped with config changes on my servers where needed
4. Were respectful and patient (to a degree befitting the situation)
5. Kept updates coming out - I was refreshing that page every 15 minutes or so as I had customers waiting to hear from me - so I really appreciated the constant info
6. Seemed to be fast on their feet in acquiring name servers, equipment, etc
7. Did generally try to help rather than pushing me off to the next guy
Could have done better:
1. Could have pushed out config suggestions more quickly. I'm pretty sure they knew about many of the issues I would run into such as the resolvers dying and could have been a bit quicker on pushing out the info
2. Could have been a bit more forthcoming about the DNS servers dying. This was kinda downplayed IMO and was a major issue for me and my customers. 5 days after fire and I still don't have access to my zone files.
3. Could have pushed a customer chat/discussion area where we could share our ideas. Many customers don't know about the forums and would have benefited from some peer assistance.
4. A phone call about the outage on Saturday would have been great. They have my number in the admin area, and it would have really changed my outlook if they would have told me things were going to get a bit bumpy.
5. Quick & regular acknowledgements of tickets would have been great. Some tickets waited days without any updates. Just a simple - 'we are aware of this ticket and it's in the queue' would have been good, and they could have paid an intern to do it
For my part, I've learned a bit too. I'm now creating a comprehensive phone list of those clients of mine that would be affected by future outages. I plan to call them if this happens again so they don't feel blindsided by me when their sites go down
I'm also formulating a more robust strategy for when this happens. I don't think I'm charging enough to many clients as a 'real' strategy is expensive, so I'm re-evaluating my hosting price schedule. I want to provide better service than the other guy and I know that's not cheap
Overall I'm happy with their response and as I said in a previous post I feel better about them than I did before the outage. I was actually formulating an exit strategy last month, but I'm now considering keeping my major servers with them. I'm moving mail and DNS elsewhere however as this has taught me that you can't keep all your eggs in one basket
Your thoughts? Honestly, I'm looking also for ways I could have better served my customers during this outage, so any perspectives are welcome :)
Even for the people updating status, a team of several people probably couldn't type fast enough to list out every problem that was being fixed, as it was fixed.
Some issues might not have been pertinent to mention in case systems were being hot-wired to get them back on, but with some security aspects not fully implemented. Don't let potential hackers have too much information.
Ticket processing. Even a team of a hundred people could take many hours to read tens of thousands of tickets, without even acting on any of them. Adding on the time making calls to technicians that are already trying to rewire and configure things, takes the response time through the roof.
I'd guess that they were wishing they had ten times as many staff at present.
I'll bet folks over there are saying to their staff something like this:
"I want to hear ideas - ANY ideas of how we could have mitigated this issue".
Given the extreme solutions they went to technically, I'll bet with a little planning and partnering they could have some more human solutions on deck for the next crisis. The folks they have did a great job, but some 'perception' issues could be avoided.
For example - how about an emergency contingency arrangment with a call center? Something like "I'll give you $xx per month to be ready to make 5,000 calls at the drop of a hat". Or how about an automated calling system (* comes to mind)? Heck I have one and I'm TINY.
Again - I'm not unhappy with TP - quite the converse, but we should explore our perceptions of what could be improved from a 'blue sky' perspective - that's what we're doing over here. We can't DO all of the things, but we're surely exploring them.
Bill
;; QUESTION SECTION:
;1.0.0.127.in-addr.arpa. IN PTR
;; ANSWER SECTION:
1.0.0.127.in-addr.arpa. 86400 IN PTR ev1s-127-0-0-1.ev1servers.net.
You need to use LVS for virtual ip management. You also need central storage so that user can see same dynamic or static page. You need to configure MySQL or any other db server in master / slave mode (there are many modes but this is easy to setup). You need to use mirroring file system to keep everything identical on all nodes using DRDB.
This is the general replication and HA technology used by many. It is built on commodity computing concept. You see this kind of technology in action when you hit google dot com (the only difference is google uses modified version of many open source project and they have deadly alogs for replication).
how about an emergency contingency arrangment with a call center? Something like "I'll give you $xx per month to be ready to make 5,000 calls at the drop of a hat".
Because it's easier to post a message to one forum and get 5,000 people to read it. You said yourself you were impressed with the feedback and updates.
Quick & regular acknowledgements of tickets would have been great
How long would it take you to answer hundreds of tickets personally that all pretty much say "I'm having trouble accessing my website".
himalayaswater, thanks for the info. MS have a certification on cluster services so I'll probably try to do that this year.
Because it's easier to post a message to one forum and get 5,000 people to read it. You said yourself you were impressed with the feedback and updates.
I guess I wasn't thinking of what's easier, but what's better for the client. Hosting is getting pretty tight, and those that go for maximum effort (rather than easier) will win the day.
For example, I was a couple of miles in the air heading to Denver when my sites went down. A phone call would have gotten to me as soon as I landed. I then would have made a call to my partner who could have done some damage control. I wouldn't check network access forums unless I thought there was a problem.
Again - not to dish on TP - I'm just thinking of things that could have made it better. I would want my clients to tell me what I could have done better during the outage.
I'm surprised to find that you don't have 3rd party monitoring system. I've account with Pingdom for monitor web site and other infrastructure such as DNS. It monitor my server from 8 to 10 location world wild and instantly let me know when anything goes wrong via text and email message. I can measure both uptime and downtime using their system.
[edited by: MLHmptn at 9:19 pm (utc) on June 5, 2008]
And, for many US-based eCommerce publishers, being physically hosted in the US is an issue, both for compliance (Patriot Act) and simple management and response commitments.