Googlebot using HEAD requests

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot using HEAD requests

Behaviour changed because sitemap deleted?

Phil_Payne

7:22 pm on Jul 6, 2006 (gmt 0)

I deleted the Google sitemap (XML format) from one site on Sunday 2 July - both the server copy and the sitemaps console entry. The reason was that Google wasn't crawling the site. At all.

Since then, the Googlebot has continued to try to download the sitemap despite the explicit deletion in the sitemap console. It ain't there no more, guys - and I've told you it isn't. Back off.

Crawling, however, resumed and a reasonable rate. Until Tuesday - when all of a sudden the Googlebot started issuing HEAD requests instead of GETs. I'm wondering what the purpose of this is, apart from confusing the Soviets. Despite at least four of the pages having been changed, the HEAD was bnot followed by a GET.

Anyone else recording HEAD requests in their logs where there used to be GETs?

asher02

10:51 am on Jul 7, 2006 (gmt 0)

Hi Phil,

In the last few days Googlebot is issuing HEAD requests too for my pages but not to many(regular crawling is still in progress). The difference is that I still use sitemap.xml

I have no idea why.

So I join your request for info...

colin_h

11:08 am on Jul 7, 2006 (gmt 0)

I heard that Googbot is now a free standing, self aware, bitter and twisted monster gone crazy!

You see, it all began when he used to cut lawns for a living and he met a mad scientist who enjoyed tinkering with VR in his garage. Well, after a few months of playing the mad doctors games, the lawn guy started to get awfully clever and eventually had dreams of taking over the world. Using the telephone lines and sat-coms he stalks the worlds data net, wreaking havoc and messing up search results and making webmasters go prematurely grey.

All the Best ... a bit bored today ;-)

Col

p.s. oh yeah I forgot to say IMHO before this.

Dayo_UK

11:17 am on Jul 7, 2006 (gmt 0)

Probably checking for last modified or if the file exists without downloading it.

Has being doing this frequently on my homepage for months.

asher02

12:10 pm on Jul 7, 2006 (gmt 0)

"Probably checking for last modified or if the file exists without downloading it. "

Maybe it is for servers that do not support "last modified" because when using a "Get" if the server send a 304 the page is not downloaded anyway.

bumpski

12:33 pm on Jul 7, 2006 (gmt 0)

FYI

If you've implemented GZIP compression in a simplistic way using PHP it is likely your server is not supporting the 304 "not changed" response and always returning content with a 200 response. And certainly the HEAD mechanism is a way around this; although slightly less efficient.

Some of my sites are implemented in this way and rarely return a 304, robots.txt returns 304's but no html documents. Google's guidlines do suggest supporting the 304 "not changed since" capability as an important plus.

I have wondered whether there is a very slight ranking penality for not supporting 304's. But GZIP compression reduces bandwidth by a factor of 3 to 4 so it's more than worth it. Google has finally commited to requesting GZIP compressed content with the BIG DADDY update. Before that they used GZIP requests sporadically.

There are more complex PHP scripts available that do properly support GZIP compression and the 304 "not changed" response.

Phil_Payne

2:19 pm on Jul 7, 2006 (gmt 0)

> Maybe it is for servers that do not support "last modified" because when using a "Get" if the server send a 304 the page is not downloaded anyway.

Nope. I support and return 304 to GETs like that all the time. Not that.

Phil_Payne

2:22 pm on Jul 7, 2006 (gmt 0)

> Probably checking for last modified or if the file exists without downloading it.

Then if it finds a page changed but doesn't download it, RFC2616 requires it to treat its cache as "stale". The RFC emphasises "MUST" in capitals. But it isn't doing.

Phil_Payne

8:30 am on Jul 8, 2006 (gmt 0)

> Then if it finds a page changed but doesn't download it, RFC2616 requires it to treat its cache as "stale". The RFC emphasises "MUST" in capitals. But it isn't doing.

And whaddya know - on several test sites here in Sheffield on this sunny Saturday morning - it is. I've just looked at a few pages where I know Google knows their true status but hasn't done a full download - and there's no "cached" option on the SERP. Either I got some of my facts wrong to start with, or it's the fastest RFC compliance in history.

arubicus

8:43 am on Jul 8, 2006 (gmt 0)

I have noticed this HEAD request for our homepage for a while now. Never thought too much about it. I can't figure out why though either.

About 304. Google just re-requests the pages when it runs into a 304 to obtain a 200 and download of the page for some reason. 99% of the time it does this for us so I would toss out a guess that it isn't caring for last modifieds just a fresh download.

burgeltz

5:10 pm on Jul 8, 2006 (gmt 0)

> Probably checking for last modified or if the file exists without downloading it.

I think that's exactly right. If the server provides a last-modified header and the file hasn't changed since last crawl, the HEAD request would save bandwidth.

Phil_Payne

9:12 am on Jul 9, 2006 (gmt 0)

> I think that's exactly right. If the server provides a last-modified header and the file hasn't changed since last crawl, the HEAD request would save bandwidth.

Well, I think it's exactly wrong. Not the comment about how HEAD works, but why Google is now using it.

A conditional GET would seem more efficient. If the page has changed, a copy is transferred. If not, just the header.

Using separate HEAD and then unconditional GET requests increases rather than decreases bandwidth consumption.

And, of course, if Google really was using HEAD for the postulated purpose, a HEAD request to a changed page would be followed by a GET - and that's not what I'm seeing.

HEAD has been around since HTTP. Why has Google started using it within the last week or so when it doesn't otherwise appear in any logs stretching back yaers?

Phil_Payne

9:35 am on Jul 9, 2006 (gmt 0)

Here's an example:

After I deleted the Google sitemap, crawling (which had pretty much stopped) started again. In the early hours of yesterday the Googleot crawled the entire site - ca. 35 pages.

Yesterday at 18:05 I changed one page.

This morning at 05:19 to 05:37 I saw four HEAD requests - one of which coincidentally hit the page I changed yesterday. That was around five hours ago - no intervening GET.

With the right keywords (esoteric - no major SEO success) it comes up at #2.

[url] - 4k - 7 Jul 2006 - Cached - Similar pages

But if you click on cached - it serves up a version from 22 June.

This drives a coach and pair through RFC2616.

directrix

11:41 pm on Jul 9, 2006 (gmt 0)

Just to add a little more information...

On one of my sites I've never had a sitemap, but on June 10th and 11th I saw HEAD requests to images (no pages); not followed by a GET request. No HEAD requests since. I have Cache-Control: max-age for images set to just over a year, if that's relevant. (No max-age is set for regular documents.)