Welcome to WebmasterWorld Guest from 54.147.44.13

Message Too Old, No Replies

Googlebot using HEAD requests

Behaviour changed because sitemap deleted?

     
7:22 pm on Jul 6, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


I deleted the Google sitemap (XML format) from one site on Sunday 2 July - both the server copy and the sitemaps console entry. The reason was that Google wasn't crawling the site. At all.

Since then, the Googlebot has continued to try to download the sitemap despite the explicit deletion in the sitemap console. It ain't there no more, guys - and I've told you it isn't. Back off.

Crawling, however, resumed and a reasonable rate. Until Tuesday - when all of a sudden the Googlebot started issuing HEAD requests instead of GETs. I'm wondering what the purpose of this is, apart from confusing the Soviets. Despite at least four of the pages having been changed, the HEAD was bnot followed by a GET.

Anyone else recording HEAD requests in their logs where there used to be GETs?

10:51 am on July 7, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 27, 2005
posts:104
votes: 0


Hi Phil,

In the last few days Googlebot is issuing HEAD requests too for my pages but not to many(regular crawling is still in progress). The difference is that I still use sitemap.xml

I have no idea why.

So I join your request for info...

11:08 am on July 7, 2006 (gmt 0)

Full Member

joined:June 11, 2005
posts:305
votes: 0


I heard that Googbot is now a free standing, self aware, bitter and twisted monster gone crazy!

You see, it all began when he used to cut lawns for a living and he met a mad scientist who enjoyed tinkering with VR in his garage. Well, after a few months of playing the mad doctors games, the lawn guy started to get awfully clever and eventually had dreams of taking over the world. Using the telephone lines and sat-coms he stalks the worlds data net, wreaking havoc and messing up search results and making webmasters go prematurely grey.

All the Best ... a bit bored today ;-)

Col

p.s. oh yeah I forgot to say IMHO before this.

Dayo_UK

11:17 am on July 7, 2006 (gmt 0)

Inactive Member
Account Expired

 
 


Probably checking for last modified or if the file exists without downloading it.

Has being doing this frequently on my homepage for months.

12:10 pm on July 7, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 27, 2005
posts:104
votes: 0


"Probably checking for last modified or if the file exists without downloading it. "

Maybe it is for servers that do not support "last modified" because when using a "Get" if the server send a 304 the page is not downloaded anyway.

12:33 pm on July 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 13, 2004
posts:798
votes: 1


FYI

If you've implemented GZIP compression in a simplistic way using PHP it is likely your server is not supporting the 304 "not changed" response and always returning content with a 200 response. And certainly the HEAD mechanism is a way around this; although slightly less efficient.

Some of my sites are implemented in this way and rarely return a 304, robots.txt returns 304's but no html documents. Google's guidlines do suggest supporting the 304 "not changed since" capability as an important plus.

I have wondered whether there is a very slight ranking penality for not supporting 304's. But GZIP compression reduces bandwidth by a factor of 3 to 4 so it's more than worth it. Google has finally commited to requesting GZIP compressed content with the BIG DADDY update. Before that they used GZIP requests sporadically.

There are more complex PHP scripts available that do properly support GZIP compression and the 304 "not changed" response.

2:19 pm on July 7, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


> Maybe it is for servers that do not support "last modified" because when using a "Get" if the server send a 304 the page is not downloaded anyway.

Nope. I support and return 304 to GETs like that all the time. Not that.

2:22 pm on July 7, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


> Probably checking for last modified or if the file exists without downloading it.

Then if it finds a page changed but doesn't download it, RFC2616 requires it to treat its cache as "stale". The RFC emphasises "MUST" in capitals. But it isn't doing.

8:30 am on July 8, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


> Then if it finds a page changed but doesn't download it, RFC2616 requires it to treat its cache as "stale". The RFC emphasises "MUST" in capitals. But it isn't doing.

And whaddya know - on several test sites here in Sheffield on this sunny Saturday morning - it is. I've just looked at a few pages where I know Google knows their true status but hasn't done a full download - and there's no "cached" option on the SERP. Either I got some of my facts wrong to start with, or it's the fastest RFC compliance in history.

8:43 am on July 8, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 27, 2003
posts:570
votes: 0


I have noticed this HEAD request for our homepage for a while now. Never thought too much about it. I can't figure out why though either.

About 304. Google just re-requests the pages when it runs into a 304 to obtain a 200 and download of the page for some reason. 99% of the time it does this for us so I would toss out a guess that it isn't caring for last modifieds just a fresh download.

5:10 pm on July 8, 2006 (gmt 0)

New User

5+ Year Member

joined:May 19, 2006
posts:14
votes: 0


> Probably checking for last modified or if the file exists without downloading it.

I think that's exactly right. If the server provides a last-modified header and the file hasn't changed since last crawl, the HEAD request would save bandwidth.

9:12 am on July 9, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


> I think that's exactly right. If the server provides a last-modified header and the file hasn't changed since last crawl, the HEAD request would save bandwidth.

Well, I think it's exactly wrong. Not the comment about how HEAD works, but why Google is now using it.

A conditional GET would seem more efficient. If the page has changed, a copy is transferred. If not, just the header.

Using separate HEAD and then unconditional GET requests increases rather than decreases bandwidth consumption.

And, of course, if Google really was using HEAD for the postulated purpose, a HEAD request to a changed page would be followed by a GET - and that's not what I'm seeing.

HEAD has been around since HTTP. Why has Google started using it within the last week or so when it doesn't otherwise appear in any logs stretching back yaers?

9:35 am on July 9, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 11, 2002
posts:140
votes: 0


Here's an example:

After I deleted the Google sitemap, crawling (which had pretty much stopped) started again. In the early hours of yesterday the Googleot crawled the entire site - ca. 35 pages.

Yesterday at 18:05 I changed one page.

This morning at 05:19 to 05:37 I saw four HEAD requests - one of which coincidentally hit the page I changed yesterday. That was around five hours ago - no intervening GET.

With the right keywords (esoteric - no major SEO success) it comes up at #2.

[url] - 4k - 7 Jul 2006 - Cached - Similar pages

But if you click on cached - it serves up a version from 22 June.

This drives a coach and pair through RFC2616.

11:41 pm on July 9, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:June 26, 2004
posts:155
votes: 0


Just to add a little more information...

On one of my sites I've never had a sitemap, but on June 10th and 11th I saw HEAD requests to images (no pages); not followed by a GET request. No HEAD requests since. I have Cache-Control: max-age for images set to just over a year, if that's relevant. (No max-age is set for regular documents.)

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members