Msg#: 24177 posted 10:12 am on May 29, 2004 (gmt 0)
Google has been seriously lagging behind in indexing the pages of one of my sites. Yahoo currently adds almost three times the number of pages to it's index daily over Google. Here is the Server Header for the site in question:
Whereas here is the Server Header for a site that uses a completely different system which Google is indexing at a much faster rate:
"Status: HTTP/1.1 200 OK Date: Sat, 29 May 2004 10:05:09 GMT Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) FrontPage/184.108.40.20623 mod_python/2.7.8 Python/1.5.2 mod_ssl/2.8.12 OpenSSL/0.9.6b DAV/1.0.3 PHP/4.3.4 mod_perl/1.26 Last-Modified: Fri, 21 May 2004 11:18:38 GMT ETag: "3d621d-12ed-40ade58e" Accept-Ranges: bytes Content-Length: 4845 Connection: close Content-Type: text/html"
Is the way my servers are 'talking' to Google putting him off? The site that is being indexed slower has many, many more inbound links and has been in the Google index for over two years. It's just that there's alot of advice floating around just now to 'Check your Server Headers' but I'm not sure what I'm supposed to be looking for.
Msg#: 24177 posted 11:09 am on May 29, 2004 (gmt 0)
I have noticed Googlebot doing similar. My theory is that if Googlebot sees the "Content-Length:" field in the header then it assumes the content is static so it crawls the website more aggressively. But if it doesn't see it then it assumes that it is dynamic content and slows down its crawling a bit.
This is just a theory of mine that supports what I have seen Googlebot do. I may of course be totally wrong.
Msg#: 24177 posted 11:44 am on May 29, 2004 (gmt 0)
That's funny, racer_x. I was thinking just that the other day. With dynamic content, the server doesn't know how much data it's sending so it chunks it. If the content length is known, as in a static page, then the server will send use the content length header instead. So those headers are a dead giveaway.
If there's any truth that G favors static pages, this is something to be aware of.
Msg#: 24177 posted 10:25 am on May 31, 2004 (gmt 0)
The "X-Powered-By: PHP/4.3.4" line can be removed and the "Server: Apache/1.3.27 (Unix) [etc.]" line can be changed, but as stevenmusumeche points out, the 'Server' and 'X-Powered-By' HTTP headers are not an issue for Google.
DaveAtIFG makes a good point, more PR helps with crawling.
A quick server response also helps encourage Googlebot to fetch more pages during its time on the site.
The main benefit of the 'Last-Modified' header is that Googlebot can send 'If-Modified-Since' headers. If it asks for a page that hasn't changed then the server can send a 304 (not modified) response, allowing Googlebot to crawl deeper into the site instead of indexing the same pages again and again.
Could this be due to simple PR differences between the sites? It's often reported that sites with higher PR are spidered more frequently.
No, the site that is not being crawled much is PR4 and the site that is being indexed really fast is PR0 (it has not yet been given a rank as it has only been in the index two months and Google hasn't done a PR update for ages).