DNS Failover - Server Failover to Backup - Website Technology Issues forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

DNS Failover - Server Failover to Backup

DNS Failover, automatically failing over primary to a backup

lakelife99

2:58 pm on Jan 30, 2019 (gmt 0)

Hi all,

Anyone here using DNS to Failover to a backup server when the primary fails? Just looking for some thoughts on what folks are using, I work in the industry and would like to see whats important for webmasters and server admins. Thanks!

[edited by: not2easy at 4:08 pm (utc) on Jan 30, 2019]
[edit reason] please see Charter/ToS [/edit]

lammert

6:58 pm on Jan 31, 2019 (gmt 0)

I use a DNS failover system with 5 servers in different data centers with different providers in different geo locations. The architecture is LAMP with a multi master relational database which supports automatic failover. The database layer uses a quorum to decide which of the databases in the network contain valid data and automatic resynchronizes out-of-sync peers. The front end software is relatively static and can be updated on all nodes from a repository.

The DNS TTL time is 300 sec which causes a significant larger amount of queries to the DNS server than in conventional setups where the TTL is defined as being several hours or even a day or longer. If your site receives significant traffic and you are using a DNS provider with query limits, the increase in DNS queries may become a problem.

Detection time of a failed node and switching to one of the secondary IP addresses takes about 45 sec which is good enough for my application.

robzilla

7:19 pm on Jan 31, 2019 (gmt 0)

Sweet set-up, lammert. AWS, Azure or something else for geo DNS? Mine is similar except it's master-slave, unfortunately, and my "quorum" is just a simple scripted replication check, and if anything goes wrong (and with replication it frequently does) I have to go in there to fix it. I'm re-thinking the set-up now as I need to move to new servers anyway, and I'm learning Ansible to hopefully automate a bunch of these tasks.

DNS failover isn't perfect, but if a server goes down or I take one out for maintenance it usually only takes a minute or so before the (human) traffic flow stops, so the damage, although probably worse than with something like a heartbeat solution, is usually not too bad.

phranque

12:29 am on Feb 1, 2019 (gmt 0)

welcome to WebmasterWorld [webmasterworld.com], lakelife99!

iamlost

5:52 pm on Feb 2, 2019 (gmt 0)

I love reading of others' efforts beyond the common defaults.

My foundational architecture is Linux (Debian) + Apache <-> Redis + PostgreSQL + C++.

I use GeoDNS with regional traffic direction (5-regions) and hidden master with ACL (Access Control Lists) and TSIG (Transaction Signatures) authentication over VPN.
Recursion is OFF!

With the increasing use of open resolvers, etc. visitor actual geolocation may be masked; by measuring RTT (round trip delay time) and if it is above a threshold requesting client IP then doing a remapping redirect one may cut RTT by up to 50%.

robzilla

3:56 pm on Feb 3, 2019 (gmt 0)

With the increasing use of open resolvers, etc. visitor actual geolocation may be masked; by measuring RTT (round trip delay time) and if it is above a threshold requesting client IP then doing a remapping redirect one may cut RTT by up to 50%.

That's a good point, people sometimes get misdirected. The IPv4 space shifts quite a bit, as well, and the underlying geoip database are not infallible. I've never really measured the impact of that, gets lost in the averages. Do you measure the RTT server-side or client-side? I don't suppose you have each client ping all 5 regions just in case they've been misdirected.

iamlost

6:31 pm on Feb 4, 2019 (gmt 0)

@robzilla: Server side.
On receipt of client's initial SYN(chronise) packet the RTT between sending server's SYN-ACK(nowledge) and receiving client's SYN-ACK is calculated and compared against decision threshold.

iamlost

6:47 pm on Feb 4, 2019 (gmt 0)

@robzilla: additional...
For me within North America the 'savings' is 250ms round trip on average. However, within east Asia/Oceania it can be more than a second.

Note: there are monthly average latency numbers available from major ISPs for various major backbone connections, i.e. Verizon currently shows 45ms round trip within North America and 250ms New Zealand to South Korea. However, these are times between major hubs not typical server and client. And most definitely not server and mobile client.

robzilla

6:48 pm on Feb 4, 2019 (gmt 0)

Interesting. How do you tell the client to connect to a different IP? Or is it domain- or URL-based remapping?

iamlost

5:17 pm on Feb 5, 2019 (gmt 0)

@robzilla:
First I have to say that not all client IPs are feasibly discernible if arriving via proxy (however typically >80% are). So, even if RTT exceeds threshold it can sometimes take longer to determine a closer/shorter connection than simply stay as is.

That said:
If the handshake RTT exceeds threshold and a client IP is readily determined a client IP lookup is performed, the returned CNAME is forwarded to the server identified as closest to client, a qualified redirect URL is created and sent to client who resends request to new closer server.

robzilla

7:35 pm on Feb 5, 2019 (gmt 0)

That's pretty cool. And 250ms is definitely noticeable. You probably incur some latency doing the lookup and forwarding, but everything will be faster from thereon.

Keep fighting the good fight :-)