Forum Moderators: open
I'm using name based virtual hosts with different ip addresses for each domain, so it would be possible to connect to the IP of the domain1.com and request domain2.com.
Or, if the HOST is not specified - the server would return the default host which is not the the target site.
Should I fix this ?
based on the above infer whatever you can although it seems pretty ridiculous to me how a bot of a major search engine like google can be so confused;)
But is there a possibility that google would try to request a page without specifying the HOST field in the request?
Technically it's possible and legal, but it strikes me as unlikely. According to my access logs, Googlebot makes HTTP 1.0 requests; the HTTP 1.0 spec (RFC 1945) doesn't require user agents to send a "Host" header as part of requests. But it doesn't forbid the "Host" header, either, and it would be very odd if Google left the header out.
przero2 said:
As of now, today, the bot is still trying to look for pages on the shared IP and encountering all 404s per my access logs.
Wait a minute, are you saying you moved an existing domain, but you're still seeing 404s on the server occupying the old IP address? If so, that's the wrong error code. RFC 2068 (the HTTP 1.1 spec) says servers that receive a request with a mismatched "Host" header MUST serve a genric "Status 400". If you're seeing 404s, either the server isn't running to spec, or Googlebot isn't including the "Host" header.
Now I think I've confused myself. Somebody needs to to an environment capture on Googlebot to see if it's sending a "Host" header with requests. I also recommend anybody having problems like przero2 double-check the configuration of both the old and new servers. There might be something wrong at the web server end.
www.hostdomain.com 216.239.46.26 - - [02/Jul/2002:14:23:56 -0400] "GET /discount-hotels/doubletree.htm HTTP/1.0" 302 249 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.26 - - [02/Jul/2002:14:23:56 -0400] "GET /discount-hotels/doubletree.htm HTTP/1.0" 200 645 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.113 - - [02/Jul/2002:14:24:16 -0400] "GET /discount-hotels/mississauga.htm HTTP/1.0" 302 238 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.113 - - [02/Jul/2002:14:24:17 -0400] "GET /discount-hotels/mississauga.htm HTTP/1.0" 200 645 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.222 - - [02/Jul/2002:14:25:41 -0400] "GET /discount-hotels/oxnard.htm HTTP/1.0" 302 233 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.222 - - [02/Jul/2002:14:25:41 -0400] "GET /discount-hotels/oxnard.htm HTTP/1.0" 200 645 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
.........
obviously googlebot cannot find those files under the server IP address?. as to how the server is configured, I am not sure as I use a local web hosting provider!
www.hostdomain.com 216.239.46.26 - - [02/Jul/2002:14:23:56 -0400] "GET /discount-hotels/doubletree.htm HTTP/1.0" 302 249 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
www.hostdomain.com 216.239.46.26 - - [02/Jul/2002:14:23:56 -0400] "GET /discount-hotels/doubletree.htm HTTP/1.0" 200 645 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
That doesn't make any sense. Your log sample shows a page redirecting to itself. That's not just weird, that's the HTTP equivilent of a science-fiction "I'm my own grandpa" time paradox. Did you mess up the search-and-replace? Is that log meant to be alternating between sitedomain and hostdomain ?
If so, the status "302" is a problem. 302 is "Found", a.k.a. "temporary redirect". User agents are allowed to cache revisit 302'ed URLs, so Googlebot isn't completely wrong in requesting the old URL. If you want to permanently redirect a page, you should use status "301" so that Googlebot knows to old address is defunct. (Although, again, status 400 is what the specs require if there's a "Host" header mismatch. Even I think this is getting too complicated.)
Here's a key reference for HTTP status codes: [w3.org...]
By the way, what server software are you running these domains on? Might be relevant.
None were from Googlebot or any other search engine that sends real traffic.
On the other hand, I have come across more than a few sites that have had problems because the server does accept additional host headers. This can lead to duplicate content problems including lost PageRank and rarely, hijacks.