Forum Moderators: open
We've recently moved peds.wustl.edu to a new web server.
The IP address for the domain changed from
128.252.238.214
to 128.252.238.55
The DNS change took place around midnight Thursday August 8th.
The old information should only have been cached by other DNS
servers for 8 hours, and by end machines for up to 12 hours.
Certainly the cached information should have cleared by today
Tuesday Aug 13th. (that midnight time was Central TZ, BTW)
We have both the old and new servers up and running, and I've
noticed that google is still indexing content on the older
machine rather than the newer one. This is distressing because
we use you guys for our site search (because you're better by
many orders of magnitude than anything I could write myself),
and I don't want to shut off the old server before googlebot
starts looking at the new server. I'm afraid that all of our
pages would fall out of your index if you failed to find them
at the old server.
So, I'm hoping that as a workaround for this sort of issue,
you might first arrange for the googlebots' collective DNS
cache to be manually expired on some scheduled basis of a day
or two. A more correct longer-term solution would be to have
the googlebot honor the client TTL for DNS data and refresh
expired DNS data properly.
Could you drop me a line with some indication of how long I
should leave the old server up and running in order to avoid
dropping out of the google index?
Thanks :)
-matt
This is an intriguing idea, and the best solution yet if Google are interested. How could a server advertise in an HTTP reply that it no longer serves a domain? I've checked the 4xx replies in RFC 2616 [ietf.org] but nothing seems suitable.
If a particular HTTP response would cause a DNS retry then the world would be a better place for Google and webmasters. How about a response of "404 Not This IP Anymore"?
Matt, can you get port redirection? If the machines are nearby it shouldn't cost any significant bandwidth. It saved me when I switched.
That server handles other sites on port 80, same IP address, using HTTP's Host: header to differentiate. So, we'd need to assemble the entire inbound port 80 message (or at least enough of it to find the Host: header) before knowing where to send the message. I don't have the expertise to do this.
...and if googlebot played nice (ie: followed the rules in DNS land), then that wouldn't be necessary, as it would already be indexing the site at the new URL.
While a "Not at this IP address" 4xx response might be neat, it really shouldn't be required. The problem already has a good solution: if you have software that uses DNS information, that software should respect the DNS TTL values. This includes googlebot.
Our TTL was 8 hours for other DNS servers, and 12 hours for end clients, so the TTL was 20 hours. Checking our logs, I can see that 20 hours after the DNS change, over 99% of our human traffic was using the new IP address. 30 hours after the DNS change, 100% of our human traffic was using the new IP address. However, we still get traffic on the old server for that site.
At this point, 100% of that traffic is web site indexing software. Google, Inktomi, DirectHit, etc. I'm concerned about each of these, but most concerned about Google since we use Google for our web site search. If we turn off the site on the old server, their index will start dropping pages as googlebot tries to index them and fails to find them.
I can understand why search engine spiders might want to store DNS information longer than the advertised shelf-life. They look up so many zillions of domain names that if they had to look up (for example) our domain name's IP address every 12 hours (the end-client TTL) or even every 20 hours, then they'd be doing hundreds of DNS queries per year for our site that gave the same response back each time. Multiply hundreds of DNS queries by zillions of domains, and you have a very serious bandwidth expense and temporal inneficiency when indexing.
So, I can see why they do it that way, but it causes problems when the DNS data changes and they've cached the old data with no immediate intention of refreshing that data. Change your domain's IP address, and you'll likely see your site gradually vanish from Google's index until they update their DNS info, and then you'll gradually be re-indexed.
Maybe a nice compromise solution would be form on Google's site that allows webmasters to say "hey, get fresh DNS info for this domain name".
-matt
The form idea seems attractive, but I'd rather make one change to my server than have to fill in a 'please update DNS' form on each search engine. Also, if you're migrating a box from one IP to another you'd have to fill in the form many times.
I was fortunate to be able to get port forwarding earlier this year; the server had several hundred domains each in several search engine databases.
Can you keep the content at the old address until Google stops spidering it? GoogleGuy indicated a while ago that Google had increased the DNS retry frequency (which is presumably a TTL override on their main DNS cache) but in my opinion it's better not to IP hop in any hurry.
I guess I'm trapped in the mindset of "I have this problem now" and looking for a timely solution. The form approach would fix things right away. The HTTP status code would fix things when servers started supporting it, those servers were installed on new and existing sites, and then search engines supported the new status code and webmasters started configuring their servers to send that code.
The form could work quicker: search engines implement it, webmasters use it.
> Also, if you're migrating a box from one IP to another
> you'd have to fill in the form many times.
...and for many search engines.
I'm comfortable automating that task if only those many search engines had those little forms. I'd happily share perl code to help others do the same it it were an option.
Thinking about it a bit more, the HTTP status code does sound like a more tidy solution, but the problem would be around for quite some time before that solution could be implemented.
Maybe when I get back from vacation I'll see if I can find a contact in the HTTP WG or with Apache to see if either of those groups could get the idea off the ground. In the mean time, I've already dropped a note with Google to see if they'd implement the "update DNS" form.
> Can you keep the content at the old address
> until Google stops spidering it?
That's going to be my approach. I'm still hoping that some other solution will become possible... even if it's well after the spiders have moved on to our new IP address. It just seems like a problem begging to be fixed :)
-matt
Google has been indexing a now ancient version at an old IP for months, and I no longer even have access to that computer to delete the index page. If I could, I would.
It has now indexed all pages *inside* my site - but insists upon returning to my tired old index page on a computer I can't even FTP myself!
It is a pretty poor state of affairs when DNS changes propagate across the Net in less than a week - and the GoogleBot takes a good fraction of a year!
What you could try to do is get a slightly modified link to your site crawled. By chance I had one with /?src=me after it for log purposes that was crawled, and my new index showed up in Google2 and 3!
- but after the dance the same old prehistoric index page was cached!
I think that there are alot of admins out there that have the TTL way way to low.
There is no reason to have a TTL of 8 hours unles you are beginning a countdown to moving your site to a new IP.
30 to 60 days is alot more appropriate for a stable site. If you need to move it you can begin 30 days ahead of time decrementing the TTL record.
At three weeks out from when you plan to move change it to 3 weeks etc.
GraniteCanyon and a few of the other free dns services would not accept the xtrfr if the ttl was to low.
Windows DNS defaults to a TTL of 1 hour. Incredible huh.
With so many DNS records out there with such Ridiculous low TTL's Google cannot help but ignore them.
On the other hand,
A site with a reasonable TTL record should have its descisions respected.
Wow! What were they thinking? or... *Were* they thinking?
> On the other hand,
>
> A site with a reasonable TTL record should
> have its descisions respected.
The problem then becomes: is a short TTL reasonable (they're about to move) or unreasonable? No way for google to tell. :(
-matt