Page is a not externally linkable
- Hardware and OS Related Technologies
-- Website Technology Issues
---- Problems with googlebot after a domain move


Air - 4:54 am on Nov 2, 2002 (gmt 0)


Ok let's see, thinking out loud to get some preliminaries out of the way. When a web server is using name based virtual hosting it uses the name of the host instead of the IP address to determine the document path for the host being requested. The IP address is still required in order to find the server in the first place. So let's consider the following simplified Apache httpd.conf file with a name based virtual host configuration. For this example assume the IP address of the server is 216.5.39.117:


NameVirtualHost 216.5.39.117:80

<VirtualHost 216.5.39.117>
ServerAdmin webmaster@domain.com
ServerName domain.com
DocumentRoot /home/domain/public_html
</VirtualHost>

<VirtualHost 216.5.39.117>
ServerAdmin webmaster@example.com
ServerName example.com
DocumentRoot /home/example/public_html
</VirtualHost>

<VirtualHost 216.5.39.117>
ServerAdmin webmaster@another-example.com
ServerName another-example.com
DocumentRoot /home/another-example/public_html
</VirtualHost>

Now, somewhere a DNS server has entries in it that allow those host names to resolve to the correct IP address. In a simplified DNS config those entries might look like this:


domain.com A 216.5.39.117
example.com A 216.5.39.117
another-example.com A 216.5.39.117

So along comes a browser (http/1.1 compliant) and it makes a request for example.com/index.html, a query is made to a DNS server to obtain the IP address of example.com, the DNS server returns 216.5.39.117 which allows the request to be routed and the web server to be found. Now because the browser uses the http/1.1 protocol it carries the host name in the http header of the request to the web server.

Since the web server has been configured to use name based virtual hosting, it knows to examine the http header of the request, so that it can extract the host name. It then compares the host name against ServerName(s) in the virtual host containers.

In this case it finds example.com within one of those containers, so it uses the DocumentRoot specified for example.com to locate the document index.html, it does this by combining DocumentRoot and URI to form the path to the document to be served i.e. /home/example/public_html/index.html

If it exhausts the list of virtual hosts without making a match, then it uses the first entry in the list of virtual hosts to determine the content to serve. It is important to realize that to arrive at this point, some host name had to have resolved to the IP address of this web server, or the request was made specifically by IP address i.e [216.5.39.117...]

The main point is that in a name based virtual hosting setup, the IP address is used to locate and make a request to the correct web server, and the host name carried in the http header is used to determine the path of the document to retrieve on the web server.

Prior to the introduction of the http/1.1 protocol and http/1.1 capable browsers, the host was not carried in the request header, therefore only the IP address could be used to determine the path of the document to be served. Each host container had to have a unique IP address assigned to it in the web server's config file in order for the correct document to be served.

Needless to say, prior to http/1.1 all hosts were set up with unique IP addresses on web servers. Search engines at that time naturally built spiders that requested sites by IP address (because that's what the then current protocol allowed), when http/1.1 arrived and name based virtual hosting became possible, early adopters found that search engine spiders had not caught up, they were still requesting hosts by IP address, and so would end up indexing the content of the first site listed in the vhost container instead of the actual host it intended to request.

This is why there were so many warnings to webmasters about having a unique IP address for your site and to stay away from name based virtual hosting. That has changed, today all modern spiders can accomodate name based hosting, you'll still see the warnings against name based hosting from time to time though.

But I digress. To get back to the question at hand then; Google comes along and decides to save time, since DNS lookups can cause delays, and can be "expensive" from a performance point of view, they decide to maintain their own list of host/IP address pairings and update that list periodically from regular DNS to pick up any new entries using an update process (the update cycle appears to be ~60 days).

As a result, if a site is moved to another web server, one with a different IP address, googlebot may come along and use it's own host/IP address pairing and request the host from the wrong server! If that host has already been removed from the old web server's config file, then the web server dutifully returns the first site listed in the virtual host containers because it can't find the host name in any of the virtual host containers. But this is only true for the index page when the root domain is requested, any other page that is requested and does not exist in the DocumentRoot+URI of the first site will return a 404.

Since your browser (ip client) will use DNS for each host request, it correctly resolves your host name (domain name) to the new web server as soon as the change of IP address propagates throughtout the DNS servers out there, and so returns the correct site, making it difficult to know that there's any problem with googlebot. Of course once googlebot has its new host/IP pairings by updating from DNS it finds your site again.

The best thing to do then IMO, is to leave the site active on the old web server, while also running it on the new. When you start seeing googlebot in the logs of your new server you'll know it has updated its IP/host file and it is safe to remove (or cancel) the other hosting account.

[edited by: DaveAtIFG at 6:18 pm (utc) on Nov. 2, 2002]
[edit reason] Cleaned up spacing problems [/edit]


Thread source:: http://www.webmasterworld.com/website_technology/1331.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com