|Problems with googlebot after a domain move|
Formerly titled "googlebot - server problem"
I've run into a weirdness involving an Apache server and Google spidering. It may also affect other spidering search engines.
It seems that Google, in the interests of conserving bandwidth, remembers IP addresses so it doesn't have to do millions of DNS lookups as it spiders the web.
Our hosting server has recently had a couple of domains move elsewhere. But Google is still spidering our server for these domains (which proves to me that google DOES remember the IP addresses, BTW) and indexing the main server pages *as if they belonged to the domain that has left*!
On doing some checking with WebBug, I find that I can interrogate our server's IP with the Host: request header set to any URI not known to the server and, instead of getting the expected 404 Not Found status, I receive a status of 200 OK and receive the server's own home page.
Is this an Apache configuration issue? If so, what configuration parameters need to be tweaked, and how, to prevent this behaviour? Or, *horrors*, is it a hitherto unsuspected weakness in the backbone of the Internet?
I've just done a sort of informal survey of several dozen servers (from my browser bookmarks).
Every Apache server I checked (including www.w3.org, who should be able to "get it right" if anybody can) displays this same behaviour.
Netscape Enterprise server (the couple I stumbled across) also seems to display the same behaviour.
Much as it *PAINS* me to admit it , IIS *sometimes* (about 4 out of 5) seems to return a four-oh-something error under these circumstances.
Can anyone offer any light on this rather arcane subject or point me to a post that has covered this previously, please?
i think the google remembering ip address is down to the TTL (Time to live) set on the domains servers.
I had the same problem with a customer he was trying to ftp into his site after he moved it over to us but the ISP he was using had the old DNS so it just kept pointing him to the old ISP.
The old isp had set there ttl at 3 weeks on the DNS. so it took him three weeks before he could update via his url.
The phenomenon you describe is completely due to Google's caching of DNS, and not to a server config problem, per se.
The best solution is to leave your "old" site active on the old server until it is spidered on the new server, and all Googlebot activity on the old server ceases.
I prefer to leave the site up on the old server until I see evidence that the site on the new server is included in search engine results. In order to do this, I include a subtle difference between the content on the old server and the new server. Other than including this "old server marker tag", I keep the site on the old server updated along with that on the new server.
In most cases, I prefer to wait two complete correct-IP spidering cycles before removing the site from the old server.
Hmm that is food for thought
thanks, I will munch on that!
I'm presently moving a client's site and plan to simply leave a redirect at the old site after DNS propagation settles down, probably in .htaccess. Hopefully this will insure spidering at the new site and avoid the duplicate content penalties that others have reportedly been awarded by Google.
The duplicate content problem is only an issue when changing domain names, as opposed to changing to a new hosting service (meaning a new IP address). If the site's IP address changes but the domain name remains the same, the only advantage to placing a redirect on the old server is to forward anyone who accesses your site using the old IP address directly, as in "http://192.168.0.1/index.html". Also, this redirect must not be done until all DNS has been updated globally (including search engine DNS caches), or an infinite request loop can be created:
Properly implemented, this redirect is not a bad idea, but not a top-priority either. It should be no cause for worry if you don't want to pay to keep the old host after the move is complete and DNS is updated everywhere.
My thinking is to redirect to the new IP Jim, and avoid the need for a DNS lookup at all. I'm not clear what triggers a duplicate content penalty and want to err on the cautious side. Another reason to redirect is that Google is reported to be slower than most at DNS updates. My intent is to continue this redirect for about two months depending on what the site logs reveal.
Thanks for your reply, JDMorgan.
Unfortunately, leaving the old sites up is not an option as we no longer have control over them.
Sharing a single IP address between multiple domains has become common practice these days simply because of the shortage of IP addresses. I'm sure this will continue to be the case until IPv6 finally gets rolled out (and maybe still even then).
What I need to know is... if there is something that can be done at the Apache configuration level to force the server to respond with a 404 Not Found (or other 400 series) status when the URI given in the Host: header field is unknown to that server.
The HTTP/1.1 specification (RFC2616) seems to say that the server SHOULD do this, but every Apache server I checked last night seems to return a 200 OK status and serve the root site on that server. Seeing as Apache is supposed to be HTTP/1.1 compliant, this makes me think that it *is* a server configuration issue.
The scary side of this from an SEO viewpoint is that, to the search engines, such a situation *could* look like the deliberate hijacking of a domain name to create a back door to the server's root site and as such, could attract penalties up to and including banning. It is also prejudicial to the findability of the original domain that has moved to a new server.
Hmmm... Interesting... redirect to new IP. I suppose it's possible that you might end up with an IP address as the URL in the Google SERP, though. I hope not, but please let us all know if it happens!
On the other hand, maybe GoogleBot will take the hint and figure it out. It'll be interesting either way - please post!
Sorry, you've got me confused... You control the server where the site used to be hosted, but now no longer control/have ownership/something the old site contents?
If you still have an account with the old hosting company, or if you own the server, then yes, you can configure it to return whatever you want. Is this the case?
Oh, I see that I can read your initial post to mean that you are a hosting provider, but you didn't actually say so. In that case, just re-configure the old virtual host and stick an IP redirect on it, as DaveAtIFG intends to do. Or disable AutoIndex, leave the directory structure empty, and that should give you a 404 response. The Apache documentation may have more useful info than I can provide!
I wouldn't worry too much about the domain hijacking issue. There are thousands of "sites" out there that return "Congratulations on successfully installing Apache server!" or some such message.
IPv6 reportedly will support one unique IP address for each square foot of the earth (or near that order of magnitude). It wouldn't break my heart if that killed the need for shared-IP name-based hosting at all - too many "messy" issues with it.
Okay, I've been digging into the Apache 1.3 documentation and come up with the following...
|Name-based Virtual Host Support |
Now when a request arrives, the server will first check if it is using an IP address that matches the NameVirtualHost. If it is, then it will look at each <VirtualHost> section with a matching IP address and try to find one where the ServerName or ServerAlias matches the requested hostname. If it finds one, then it uses the configuration for that server. If no matching virtual host is found, then the first listed virtual host that matches the IP address will be used.
As a consequence, the first listed virtual host is the default virtual host. The DocumentRoot from the main server will never be used when an IP address matches the NameVirtualHost directive. If you would like to have a special configuration for requests that do not match any particular virtual host, simply put that configuration in a <VirtualHost> container and list it first in the configuration file.
On this particular server (not ours btw, but we cooperate closely with the owner/admin), the "root" domain is also set up as the first vhost entry. So per the above, whenever the server sees an unknown Host: header, the root domain is served in lieu.
I think the answer will be to create a new vhost definition at the TOP of the list in httpd.conf...
ScriptAlias /cgi-bin/ /home/default/cgi-bin/
CustomLog logs/default-access_log combined
and then set up a .htaccess ErrorDocument in the default/public_html directory that points to a custom error handler script.
Assuming the vhost entry works as expected and doesn't generate any unexpected/undesirable side-effects (comments please, all you Apache gurus :)), the error handler script can then contain a list of moved domains and return the appropriate HTTP headers. (Or just simply return a straight 404, or just not even bother to create the default/public_html directory in the first place and let the server return the 404 Not Found status header!)
If this approach works, it might be worth suggesting to apache.org for inclusion in their template httpd.conf file together with appropriate comments, thus making all future Apache installations more Google-friendly :).
It would certainly save having to reconfigure the server every time someone up and leaves for another hosting service!
It's been a while since I actually did this, so I may have forgotten the details. However, I think the solution you need is a combination of IP-based virtual hosts and name-based virtual hosts.
When mixing and matching NameVirtualHost and VirtualHost directives, you have to set one IP aside for the NameVirtualHosts. Set that IP as the default for IP VirtualHosts. Then set up a NameVirtualHost on that ip for the name or names that are supposed to be served by the IP, and a default NameVirtualHost with some content letting visitors know that the domain they are looking for moved. Perhaps a cutom 404?
Like I said, it's been a while since I did this, so I may have forgotten an important detail or three.
If you decide to use the <VirtualHost *> approach, make sure that you check the Apache docs to confirm whether that should be the first or the last vhost - I could easily immagine that the first listed matching host would serve every request, in which case having <VirtualHost *> at the top could make all other vhosts stop working.
Sounds to me like you're on the right track. I hope someone with more virtual server experience will come along and comment on this.
<edit> Thanks dingman! </edit>
Ok let's see, thinking out loud to get some preliminaries out of the way. When a web server is using name based virtual hosting it uses the name of the host instead of the IP address to determine the document path for the host being requested. The IP address is still required in order to find the server in the first place. So let's consider the following simplified Apache httpd.conf file with a name based virtual host configuration. For this example assume the IP address of the server is 126.96.36.199:
Now, somewhere a DNS server has entries in it that allow those host names to resolve to the correct IP address. In a simplified DNS config those entries might look like this:
domain.com A 188.8.131.52
example.com A 184.108.40.206
another-example.com A 220.127.116.11
So along comes a browser (http/1.1 compliant) and it makes a request for example.com/index.html, a query is made to a DNS server to obtain the IP address of example.com, the DNS server returns 18.104.22.168 which allows the request to be routed and the web server to be found. Now because the browser uses the http/1.1 protocol it carries the host name in the http header of the request to the web server.
Since the web server has been configured to use name based virtual hosting, it knows to examine the http header of the request, so that it can extract the host name. It then compares the host name against ServerName(s) in the virtual host containers.
In this case it finds example.com within one of those containers, so it uses the DocumentRoot specified for example.com to locate the document index.html, it does this by combining DocumentRoot and URI to form the path to the document to be served i.e. /home/example/public_html/index.html
If it exhausts the list of virtual hosts without making a match, then it uses the first entry in the list of virtual hosts to determine the content to serve. It is important to realize that to arrive at this point, some host name had to have resolved to the IP address of this web server, or the request was made specifically by IP address i.e [22.214.171.124...]
The main point is that in a name based virtual hosting setup, the IP address is used to locate and make a request to the correct web server, and the host name carried in the http header is used to determine the path of the document to retrieve on the web server.
Prior to the introduction of the http/1.1 protocol and http/1.1 capable browsers, the host was not carried in the request header, therefore only the IP address could be used to determine the path of the document to be served. Each host container had to have a unique IP address assigned to it in the web server's config file in order for the correct document to be served.
Needless to say, prior to http/1.1 all hosts were set up with unique IP addresses on web servers. Search engines at that time naturally built spiders that requested sites by IP address (because that's what the then current protocol allowed), when http/1.1 arrived and name based virtual hosting became possible, early adopters found that search engine spiders had not caught up, they were still requesting hosts by IP address, and so would end up indexing the content of the first site listed in the vhost container instead of the actual host it intended to request.
This is why there were so many warnings to webmasters about having a unique IP address for your site and to stay away from name based virtual hosting. That has changed, today all modern spiders can accomodate name based hosting, you'll still see the warnings against name based hosting from time to time though.
But I digress. To get back to the question at hand then; Google comes along and decides to save time, since DNS lookups can cause delays, and can be "expensive" from a performance point of view, they decide to maintain their own list of host/IP address pairings and update that list periodically from regular DNS to pick up any new entries using an update process (the update cycle appears to be ~60 days).
As a result, if a site is moved to another web server, one with a different IP address, googlebot may come along and use it's own host/IP address pairing and request the host from the wrong server! If that host has already been removed from the old web server's config file, then the web server dutifully returns the first site listed in the virtual host containers because it can't find the host name in any of the virtual host containers. But this is only true for the index page when the root domain is requested, any other page that is requested and does not exist in the DocumentRoot+URI of the first site will return a 404.
Since your browser (ip client) will use DNS for each host request, it correctly resolves your host name (domain name) to the new web server as soon as the change of IP address propagates throughtout the DNS servers out there, and so returns the correct site, making it difficult to know that there's any problem with googlebot. Of course once googlebot has its new host/IP pairings by updating from DNS it finds your site again.
The best thing to do then IMO, is to leave the site active on the old web server, while also running it on the new. When you start seeing googlebot in the logs of your new server you'll know it has updated its IP/host file and it is safe to remove (or cancel) the other hosting account.
[edited by: DaveAtIFG at 6:18 pm (utc) on Nov. 2, 2002]
[edit reason] Cleaned up spacing problems [/edit]
Thanks, Air, for your clear explanation of the process.
I'm pretty sure the approach I outlined earlier in the thread will work. I just need to convince the owner/admin of the server to give it a try! :)
A little bit more info about the server may be in order. It is set up for name-based virtual hosting on a single IP which means (I think) that the base site needs to also be defined as a virtual host. The base site actually has two entries in the vhost list, like ...
CustomLog logs/access_log combined
CustomLog logs/access_log combined
... which I think is so the server will respond to Host: headers containing either www.-base-domain-.com.au or just -base-domain-.com.au
I think (please correct me if I'm wrong :)) that a single vhost container with a ServerName of *.base-domain.com.au will achieve the same effect?
From my study of the Apache docs on virtual hosting, I believe this is not an optimal configuration. The VirtualHost lines *should* have the actual IP address as given in the <NameVirtualHost> directive. Doing it this way requires Apache to resolve the URL to an IP at startup, slowing down the startup sequence somewhat. Plus, if any of the URLs fail to resolve for some reason then those vhosts will be off the air for the duration. The DNS servers are running on the same machine, one of them on the same IP as the server and the second on a different IP. So the present configuration would also *require* the DNS processes to be running before Apache is started up. Correct?
The only reservation I have about my proposed solution (adding a vhost along these lines to the top of the vhost list) ...
ScriptAlias /cgi-bin/ /home/default/cgi-bin/
CustomLog logs/default-access_log combined
... is that it may somehow bollix up the base site, but I can't figure out any way in which it could.
|But this is ony true for the index page when the root domain is requested, any other page that is requested and does not exist in the DocumentRoot+URI of the first site will return a 404. |
This may be true of the listings in Google prior to a monthly update. But if the site gets re-spidered and re-indexed the story is a little different. We actually saw this happen a couple of days ago, before Google had settled down from its last update. Now though, it just returns the base domain index page and 404's for the other indexed links, exactly as you described. So it would appear that Google decided to revert to its previous index of the site concerned.
The other factor with this particular site is that it has not moved to a new server. It has been decommissioned entirely and removed from the DNS. Maybe this means that Google will keep trying a little bit longer with its stored IP, unless and until it starts receiving 404 Not Found responses.
|....a ServerName of *.base-domain.com.au will achieve the same effect? |
Yes you are correct, but not with the ServerName directive , a ServerAlias directive must be added to the VirtualHost container to specify "alias" or "wildcard" hosts. I agree with you that they have two VirtualHost containers for the purpose of having the host with and without "www".
|The DNS servers are running on the same machine, one of them on the same IP as the server and the second on a different IP. So the present configuration would also *require* the DNS processes to be running before Apache is started up. Correct? |
Yes that's right (..and a huge single point of failure)
|... is that it may somehow bollix up the base site, but I can't figure out any way in which it could. |
It wouldn't mess it up, but if I read that right it means that entering the IP address of the server will then return a 404, which is technically incorrect, some ISP's return a list of links to other vhosts on that server as the "default" site, while some ISP's prefer to display their own site as the default. If you can find one that is accomodating, or you admin your own server, I don't see any real problem with what you suggest.
You're right about it being "a huge single point of failure". But it's been quite reliable so far, touch wood :).
As far as requests to the IP without a Host: header, I can handle that pretty easily with a custom 404 script, I think, by just returning a 301 Moved Permanently status and a Location: -www.base-domain.com.au-. Then for anything else I can return 404 or 301 as appropriate.
The only remaining question in my mind is whether returning a 301 Moved Permanently to GoogleBot that returns the "same" URL that Google already "knows" will prompt GoogleBot to do a DNS lookup to update its stored IP for that URL.