Welcome to WebmasterWorld Guest from 188.8.131.52
Since then, the Googlebot has continued to try to download the sitemap despite the explicit deletion in the sitemap console. It ain't there no more, guys - and I've told you it isn't. Back off.
Crawling, however, resumed and a reasonable rate. Until Tuesday - when all of a sudden the Googlebot started issuing HEAD requests instead of GETs. I'm wondering what the purpose of this is, apart from confusing the Soviets. Despite at least four of the pages having been changed, the HEAD was bnot followed by a GET.
Anyone else recording HEAD requests in their logs where there used to be GETs?
joined:June 11, 2005
You see, it all began when he used to cut lawns for a living and he met a mad scientist who enjoyed tinkering with VR in his garage. Well, after a few months of playing the mad doctors games, the lawn guy started to get awfully clever and eventually had dreams of taking over the world. Using the telephone lines and sat-coms he stalks the worlds data net, wreaking havoc and messing up search results and making webmasters go prematurely grey.
All the Best ... a bit bored today ;-)
p.s. oh yeah I forgot to say IMHO before this.
If you've implemented GZIP compression in a simplistic way using PHP it is likely your server is not supporting the 304 "not changed" response and always returning content with a 200 response. And certainly the HEAD mechanism is a way around this; although slightly less efficient.
Some of my sites are implemented in this way and rarely return a 304, robots.txt returns 304's but no html documents. Google's guidlines do suggest supporting the 304 "not changed since" capability as an important plus.
I have wondered whether there is a very slight ranking penality for not supporting 304's. But GZIP compression reduces bandwidth by a factor of 3 to 4 so it's more than worth it. Google has finally commited to requesting GZIP compressed content with the BIG DADDY update. Before that they used GZIP requests sporadically.
There are more complex PHP scripts available that do properly support GZIP compression and the 304 "not changed" response.
Then if it finds a page changed but doesn't download it, RFC2616 requires it to treat its cache as "stale". The RFC emphasises "MUST" in capitals. But it isn't doing.
And whaddya know - on several test sites here in Sheffield on this sunny Saturday morning - it is. I've just looked at a few pages where I know Google knows their true status but hasn't done a full download - and there's no "cached" option on the SERP. Either I got some of my facts wrong to start with, or it's the fastest RFC compliance in history.
About 304. Google just re-requests the pages when it runs into a 304 to obtain a 200 and download of the page for some reason. 99% of the time it does this for us so I would toss out a guess that it isn't caring for last modifieds just a fresh download.
Well, I think it's exactly wrong. Not the comment about how HEAD works, but why Google is now using it.
A conditional GET would seem more efficient. If the page has changed, a copy is transferred. If not, just the header.
Using separate HEAD and then unconditional GET requests increases rather than decreases bandwidth consumption.
And, of course, if Google really was using HEAD for the postulated purpose, a HEAD request to a changed page would be followed by a GET - and that's not what I'm seeing.
HEAD has been around since HTTP. Why has Google started using it within the last week or so when it doesn't otherwise appear in any logs stretching back yaers?
After I deleted the Google sitemap, crawling (which had pretty much stopped) started again. In the early hours of yesterday the Googleot crawled the entire site - ca. 35 pages.
Yesterday at 18:05 I changed one page.
This morning at 05:19 to 05:37 I saw four HEAD requests - one of which coincidentally hit the page I changed yesterday. That was around five hours ago - no intervening GET.
With the right keywords (esoteric - no major SEO success) it comes up at #2.
[url] - 4k - 7 Jul 2006 - Cached - Similar pages
But if you click on cached - it serves up a version from 22 June.
This drives a coach and pair through RFC2616.
On one of my sites I've never had a sitemap, but on June 10th and 11th I saw HEAD requests to images (no pages); not followed by a GET request. No HEAD requests since. I have Cache-Control: max-age for images set to just over a year, if that's relevant. (No max-age is set for regular documents.)