googlebot head request

Forum Moderators: open

Message Too Old, No Replies

googlebot head request

anyone seen googlebot performing head requests?

enigma1

8:10 am on Jun 9, 2009 (gmt 0)

I saw an entry in my logs like this:

66.249.71.nnn - - [08/Jun/2009:17:13:11 -0400] "HEAD / HTTP/1.1" 405 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I have lots of records from googlebot and ips in that range but I could not find one with a head request before. Is this normal?

Typically I block head requests as they can consume resources on the server without serving anything to the client.

tangor

10:28 am on Jun 9, 2009 (gmt 0)

Can't say I have... though my HEAD requests are in the <1% range so I don't track them.

Pfui

4:09 pm on Jun 9, 2009 (gmt 0)

I rarely see HEADs from Googlebot (w/ FeedBurner from google.com, yes; w/ Googlebot from googlebot.com, no). Just one this month to date, also yesterday:

crawl-66-249-68-nnn.googlebot.com - - [08/Jun/2009:14:14:13 -0700] "HEAD / HTTP/1.1" 200 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Like the OP, I block HEAD requests, usually because they only come from bad bots or people copy-pasting content into Word and snagging graphics (thus making a hijacked graphics mess of my logs). I only allow HEAs if it's Googlebot from googlebot.com or an AOL-specific UA from aol.com.

[edited by: Pfui at 4:12 pm (utc) on June 9, 2009]

jdMorgan

5:44 pm on Jun 9, 2009 (gmt 0)

Blocking HEAD requests isn't usually very useful, since the message-headers section of a 403-Forbidden response actually contain the same information as a response to a HEAD request would contain.

To be clear, the server response to a client HEAD request contains the usual response headers, but the message-headers in the response are not followed by a message-body (i.e. there is no "page" or other content following those headers). A 403-Forbidden response contains essentially the same headers, followed by the the message-body containing your custom 403 error page content (or the server's default 403 response text if no custom 403 error page is defined).

The only difference is in the response headers related to the requested content, e.g. Content-Encoding, Content-Type, and Content-Length: If the resource for which the HEAD was requested is not of the same type as your 403 error response, then those headers will differ. For example, if a HEAD request for a .gif image is received, a 200-OK response to that request would indicate a Content-Type of "image/gif," while a 403 response would indicate the contents of the error document in the message-body of this response, most likely "text/html" instead of the Content-Type of the requested object.

As a result, there's little use in issuing a 403-Forbidden response to a HEAD request. I prefer to save the server from the bother of detecting and handling them, so I simply let HEAD requests pass.

For those interested in the details of HEAD requests and responses, see RFC-2616 - Hypertext Transfer Protocol -- HTTP/1.1 [w3.org], section 9.4.

Just an FYI -- YMMV.

Jim

enigma1

5:56 pm on Jun 9, 2009 (gmt 0)

Blocking the HEAD requests does not mean 403. If you see the original entry posted the response is 405 which means "method not allowed". It would had been the same if they tried a "PUT" method in other words.

Pfui

10:30 pm on Jun 9, 2009 (gmt 0)

Jim, no doubt I'm missing your point but -- there's always a but:) -- 403'ing HEAD requests is eminently useful to me when the host/UA should be blocked but ordinarily might not be. For example, when an apparently normal host combines with apparently normal UAs:

69.65.41.nnn - - [03/Jun/2009:13:03:56 -0700] "HEAD / HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

69.65.41.nnn - - [05/Jun/2009:11:08:58 -0700] "HEAD /welcome.html HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)"

69.65.41.nnn - - [05/Jun/2009:16:01:33 -0700] "HEAD /welcome.html HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)"

However, those aren't normal -- they're not real a real person visiting in real time.

They're something automatic and they return despite repeat denials,* and that means a bot process and that's against our TOS. So denying access is an appropriate response, imho.

When I see those kinds of accesses, nine times out of 10 they're cloaked bots or harvesters, or browser add-ons running through every link on a Wikipedia or blog page. Blocking the request(s) means I don't have to think twice about who they are or what they might be doing.

Worse thing is, nowadays, the vast majority of HEADs come from all the crap coming from amazonaws.com. I have NO desire whatsoever to serve up anything but denials to any of them. Why allow HEADs when they're all disallowed from the GET-go?

.
*04-03-09: Exact same Host batted for the Mac team last April...
69.65.41.nnn
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.8.1.11) Gecko/20071128 Camino/1.5.4

[edited by: Pfui at 10:32 pm (utc) on June 9, 2009]

jdMorgan

11:16 pm on Jun 9, 2009 (gmt 0)

If you're blocking HEAD when it's inappropriate, then that's fine, assuming that you log that request or act on it in some additional way such as banning the IP address.

But the difference between "what they get from you" between a 200-OK'ed HEAD response and a 403 or 405 is pretty much only the text in your 403/405 error document. Everything that they could have gotten from a successful HEAD request is also delivered in a 403 or 405 response (Check it out with Live HTTP Headers to see the details).

And this is the likely reason that "they come back despite being blocked" -- They are getting what they want: Your server type and applications versions, the last-modified date of the requested resource, etc.

In fact, many servers return the webmaster contact e-mail address with any error response, so you may actually be giving them *more* information with a 40x response.

Just some information folks, use it as you will.

Jim

Pfui

4:18 am on Jun 10, 2009 (gmt 0)

As always, thanks for yet another learned, measured reply, Jim:)

1.) FWIW, I routinely quickly review all HEAD requests via a script and, as appropriate (smiles), specifically [F,L] all Hosts/IPs/UAs that aren't otherwise denied.

2.) About the "they come back despite being blocked" part: They simply can't always get what they want, imho.

We all know bots that come back no matter what, regardless of GET or HEAD, OPTIONS or TRACE, 200 or 403, 302 or 301 (e.g., to 127.0.0.1), SetEnvIf or RewriteCond. Many sequentially run our IP block's addresses like the WAR GAMES kid's war dialer ran phone numbers.

That TOS-violating botrunners use HEAD to 'appropriately' case my server shouldn't mean I have to let them in any more than strangers who used Google maps to case my house means I have to let them in.

3.) Thanks to .htconfig, contact addresses don't appear automatically in error messages, ditto the server type details. Of course, headers are another thing entirely. Are there ways not to show server-specifics like applications and Keep-Alive info (in Apache 1.3.22)? Does it matter?

enigma1

9:23 am on Jun 10, 2009 (gmt 0)

There is a difference between /GET and /HEAD as well as between the 403 and 405. With HEAD the document size must be sent alone, without retrieving anything from the server. Now at the application level the server can execute extensive operations (like db queries) that consume resources that could well serve regular visitors instead, All these in order to calculate the document's length. Well that's a problem for me at least. At times I did change that and send a 200 OK with a fake doc length to see how the client responds in subsequent requests. After checking the logs all the head requests were artificial, until this one here.

Now browsers don't do that nor the regular spiders. With 405 you can tell the allowed methods for the access on the specific page. That's documented in the rfc. So in my case I do list the allowed methods with the 405 header. I do list /GET and /POST are allowed and I believe it's just better than a fake length with a 200 header.

Now as of the particular access I posted at the top there various interesting things I found since.

1. The particular IP is recent in my logs, started visiting in May 09.
2. The UA of that same IP changes, here is some other entries of the same IP:
66.249.71.nnn - - [01/May/2009:10:52:12 -0400] "GET /sample_file.htm HTTP/1.1" 304 - "-" "Googlebot-Image/1.0"

66.249.71.nnn - - [08/May/2009:20:07:11 -0400] "GET / HTTP/1.1" 200 57515 "-" "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"

Here is another from the same range.
66.249.71.nnn - - [16/Feb/2009:07:46:12 -0500] "GET / HTTP/1.1" 200 69950 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"

Now the image access I don't worry about as other ips from googlebot do the same. But those mobile like UAs I am not so sure. What's this, gadget ads?

3. In that ip range 66.249.71.nnn I have seen from the googlebot other strange things among them:
a) It can be redirected with URIs that constitute hack attempts, here is another entry:
66.249.71.nnn - - [01/May/2009:15:03:23 -0400] "GET /proxy.php?ter=&zon=eur&url=http://www.example.com/page.htm%7C%7C HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Now I don't have scripts like these on my server.
b) It hits the server multiple times within the same second requesting pages and many times ignoring 304 cache headers.

Plus the head request mentioned. I will try to cross-reference this on some other servers and see what else comes up.

jdMorgan

1:02 am on Jun 13, 2009 (gmt 0)

pfui,

Your item #2 is known as the "Mick Jagger theorem." Assuming that what they need is just *any* kind of response, then it is true: The can't always get what they want, but if they try sometimes, they just might find, they get want they need (even with a 403 response.) The only way to shut them down properly is at the firewall, by dumping their requests into the 'black hole' before it even reaches your server.

Depending on your permissions on the server, you can configure the "Server" header to return minimal info, such as just "Apache," and so 'hide' the names and version of your major applications, etc. See Apache core ServerTokens and ServerSignature directives.

enigma,

> But those mobile-like UAs I am not so sure. What's this, gadget ads?

These are the crawlers for Google's mobile search index. You will also see one spoofing DOCOMO for the Asian mobile search results. All the mobile UAs you posted look legitimate to me (I do have some mobile content on my sites in XHTML+XML-MP format so I see these UAs quite often.)

Jim

enigma1

12:36 pm on Jun 13, 2009 (gmt 0)

These are the crawlers for Google's mobile search index. You will also see one spoofing DOCOMO for the Asian mobile search results. All the mobile UAs you posted look legitimate to me (I do have some mobile content on my sites in XHTML+XML-MP format so I see these UAs quite often.)

You mean they're legitimate because google uses them? Just the googlebot-mobile would been enough. As they're now, they look spamy. In the end they can pull-in any brand and just stuff it as the UA.

jdMorgan

2:55 pm on Jun 13, 2009 (gmt 0)

Yes, they can and they do "pull in any brand and stuff it in the UA." So far, I've seen them spoof the top four mobile devices representing their capability-classes so far: A Nokia, an iPhone, a SamSung, and the DoCoMo device mentioned above.

They do this because many mobile sites use "browser sniffing" to determine how to serve content, and Google is looking for mobile content when they send these requests. Mobile devices vary so wildly in their screen size, browser rendering capabilities, and standards compliance that this approach is necessary for many Web sites, and Google is trying to support it.

So for many reasons, this is not only a good idea, but is actually needed to work properly. Rather than view this with suspicion, I suggest that you view it in the same way that you view the fact that many robots put "Mozilla/4.0 (compatible;" on the front of their actual UA, which they do for the exact same reason.

As far as I can tell, they are *all* legitimate (as long as they come from IP addresses having crawl-nn-nnn-nn-nn.googlebot.com rDNS), and I get dozens to hundreds of visits per day from these "cell-phone Googlebots" to my mobile content URLs. I also get a few to my non-mobile content URLs, as they figure out what's mobile content and what's not: XHTML+XML-MP pages vs. HTML pages, in my case.

Basically, you can match the "phone make and model" part of the user-agent string if you want to treat the request as a mobile device, or you can match the "googlebot" part of the UA if you want to treat it as a robot and then use the content-type preference list in the HTTP Accept request header to further decide what to serve. This suits both the fine/complex and course/simple mobile device support methods available to mobile webmasters.

Take a look at the Google Mobile Webmaster Help pages for more info.

Jim

enigma1

12:03 pm on Jun 18, 2009 (gmt 0)

I believe they don't have to. There are other headers like the HTTP_ACCEPT that signify the content that the client end can read. I find it much more important than the UA that has no real use and can be setup anyway someone wants while the server will still use the various format encodings to send the output to the client. Using the accept header you can always switch to the content the client end understands (assuming the server supports it)

The only use I have seen for the UA is whether someone enters a site differentiating a bot from a human based on the http string which may signify a spider.

jdMorgan

1:07 pm on Jun 18, 2009 (gmt 0)

The problem with HTTP_ACCEPT is that even with mobile devices, it often includes "text/html". And even if it does not, it almost always includes "*" at the end, and so indicates that text/html content is acceptable. On the other hand, many browsers include text/xhtml+xml in their Accept headers.

Therefore, many Webmasters have learned that in order to detect mobile devices reliably, it is not sufficient just to examine the Accept header for acceptable content-types. In fact, there is a huge script/database package that is widely used to examine mobile device User-agent strings and return hundreds of parameters describing each mobile device's specific characteristics and capabilities.

You can view Google's user-agent spoofing method with suspicion if you like, but I assure you that it is not only necessary but also a very good idea. Having implemented several mobile sites, I use *all* of the information googlebot-mobile provides in its request headers to serve the proper content to each type of "device" that they claim to be, while also modifying that content slightly for googlebot so that it caches appropriate information for display in mobile search results for each device class (i.e. XHTML-MP, iMode, etc.).

Jim

enigma1

1:54 pm on Jun 18, 2009 (gmt 0)

Yea but mobile devices will also include in the HTTP_ACCEPT things like application/vnd.wap.xhtml+xml,text/vnd.wap.wml basically some note for the wireless protocol.

It's not matter of suspicion only but it's matter of usability and following a spec as closely as possible. At the end is the HTTP_ACCEPT from the client to identify the content needed (subsequently send by the server). You cannot ignore that and rely on the UA actually. I would expect a mobile device to send the proper request headers to the server end.

Another thing is the UA info can be really vague and harder to manage based on brand signatures.

Now if you see mobile devices that they don't include a media that identifies a mobile (eg: if they send just the regular accept headers of an ordinary browser) there is something wrong with the mobile device. If I see just a */* with the request and nothing else, I will send out the default page format from the server.

Also if you check the googlebot request headers you will see it will include a reference for a wireless protocol. If you say the info on the accept header is not enough to be used as an encoding format then the rfc specs are inadequate.

What each webmaster does I am sure is up to him. Many treat a blank UA as a hidden bot or as a spammer but I don't because I also surf with the UA blocked. And reading the rfcs the UA does not signify a format encoding by any means.

jdMorgan

2:18 pm on Jun 18, 2009 (gmt 0)

> following a spec as closely as possible.
> there is something wrong with the mobile device.

Mobile devices are rushed to market due to the competitive nature of that industry, and the vendors apparently have little time for specs. A cursory review of their headers and client behaviour will aptly demonstrate this as true.

There is a lot wrong with many mobile devices.

Many devices do not include "vnd.wap" in their accept headers, and that is only one problem. By including a valid mobile device in their UA string *and* acknowledging that the request is from googlebot-mobile, Google is simply making it possible to correctly support almost all mobile devices in the most flexible and thorough manner.

You can do as you please, but I prefer to use all of the information provided in the headers, and not to turn away traffic simply because the visitor had the misfortune of buying a mobile device from a non-specifications-compliant vendor. We can point to specs and standards all day long (as we did for many, many years with MSIE's non-compliant CSS support, for example), but in the end it comes down to choosing whether or not to serve the correct content to all devices, spec-compliant or not.

A bit of pragmatism is called for.

Jim