Forum Moderators: open
66.249.71.nnn - - [08/Jun/2009:17:13:11 -0400] "HEAD / HTTP/1.1" 405 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I have lots of records from googlebot and ips in that range but I could not find one with a head request before. Is this normal?
Typically I block head requests as they can consume resources on the server without serving anything to the client.
crawl-66-249-68-nnn.googlebot.com - - [08/Jun/2009:14:14:13 -0700] "HEAD / HTTP/1.1" 200 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Like the OP, I block HEAD requests, usually because they only come from bad bots or people copy-pasting content into Word and snagging graphics (thus making a hijacked graphics mess of my logs). I only allow HEAs if it's Googlebot from googlebot.com or an AOL-specific UA from aol.com.
[edited by: Pfui at 4:12 pm (utc) on June 9, 2009]
To be clear, the server response to a client HEAD request contains the usual response headers, but the message-headers in the response are not followed by a message-body (i.e. there is no "page" or other content following those headers). A 403-Forbidden response contains essentially the same headers, followed by the the message-body containing your custom 403 error page content (or the server's default 403 response text if no custom 403 error page is defined).
The only difference is in the response headers related to the requested content, e.g. Content-Encoding, Content-Type, and Content-Length: If the resource for which the HEAD was requested is not of the same type as your 403 error response, then those headers will differ. For example, if a HEAD request for a .gif image is received, a 200-OK response to that request would indicate a Content-Type of "image/gif," while a 403 response would indicate the contents of the error document in the message-body of this response, most likely "text/html" instead of the Content-Type of the requested object.
As a result, there's little use in issuing a 403-Forbidden response to a HEAD request. I prefer to save the server from the bother of detecting and handling them, so I simply let HEAD requests pass.
For those interested in the details of HEAD requests and responses, see RFC-2616 - Hypertext Transfer Protocol -- HTTP/1.1 [w3.org], section 9.4.
Just an FYI -- YMMV.
Jim
69.65.41.nnn - - [03/Jun/2009:13:03:56 -0700] "HEAD / HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
69.65.41.nnn - - [05/Jun/2009:11:08:58 -0700] "HEAD /welcome.html HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)"
69.65.41.nnn - - [05/Jun/2009:16:01:33 -0700] "HEAD /welcome.html HTTP/1.1" 403 0 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)"
However, those aren't normal -- they're not real a real person visiting in real time.
They're something automatic and they return despite repeat denials,* and that means a bot process and that's against our TOS. So denying access is an appropriate response, imho.
When I see those kinds of accesses, nine times out of 10 they're cloaked bots or harvesters, or browser add-ons running through every link on a Wikipedia or blog page. Blocking the request(s) means I don't have to think twice about who they are or what they might be doing.
Worse thing is, nowadays, the vast majority of HEADs come from all the crap coming from amazonaws.com. I have NO desire whatsoever to serve up anything but denials to any of them. Why allow HEADs when they're all disallowed from the GET-go?
.
*04-03-09: Exact same Host batted for the Mac team last April...
69.65.41.nnn
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.8.1.11) Gecko/20071128 Camino/1.5.4
[edited by: Pfui at 10:32 pm (utc) on June 9, 2009]
But the difference between "what they get from you" between a 200-OK'ed HEAD response and a 403 or 405 is pretty much only the text in your 403/405 error document. Everything that they could have gotten from a successful HEAD request is also delivered in a 403 or 405 response (Check it out with Live HTTP Headers to see the details).
And this is the likely reason that "they come back despite being blocked" -- They are getting what they want: Your server type and applications versions, the last-modified date of the requested resource, etc.
In fact, many servers return the webmaster contact e-mail address with any error response, so you may actually be giving them *more* information with a 40x response.
Just some information folks, use it as you will.
Jim
1.) FWIW, I routinely quickly review all HEAD requests via a script and, as appropriate (smiles), specifically [F,L] all Hosts/IPs/UAs that aren't otherwise denied.
2.) About the "they come back despite being blocked" part: They simply can't always get what they want, imho.
We all know bots that come back no matter what, regardless of GET or HEAD, OPTIONS or TRACE, 200 or 403, 302 or 301 (e.g., to 127.0.0.1), SetEnvIf or RewriteCond. Many sequentially run our IP block's addresses like the WAR GAMES kid's war dialer ran phone numbers.
That TOS-violating botrunners use HEAD to 'appropriately' case my server shouldn't mean I have to let them in any more than strangers who used Google maps to case my house means I have to let them in.
3.) Thanks to .htconfig, contact addresses don't appear automatically in error messages, ditto the server type details. Of course, headers are another thing entirely. Are there ways not to show server-specifics like applications and Keep-Alive info (in Apache 1.3.22)? Does it matter?
Now browsers don't do that nor the regular spiders. With 405 you can tell the allowed methods for the access on the specific page. That's documented in the rfc. So in my case I do list the allowed methods with the 405 header. I do list /GET and /POST are allowed and I believe it's just better than a fake length with a 200 header.
Now as of the particular access I posted at the top there various interesting things I found since.
1. The particular IP is recent in my logs, started visiting in May 09.
2. The UA of that same IP changes, here is some other entries of the same IP:
66.249.71.nnn - - [01/May/2009:10:52:12 -0400] "GET /sample_file.htm HTTP/1.1" 304 - "-" "Googlebot-Image/1.0"
66.249.71.nnn - - [08/May/2009:20:07:11 -0400] "GET / HTTP/1.1" 200 57515 "-" "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"
Here is another from the same range.
66.249.71.nnn - - [16/Feb/2009:07:46:12 -0500] "GET / HTTP/1.1" 200 69950 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"
Now the image access I don't worry about as other ips from googlebot do the same. But those mobile like UAs I am not so sure. What's this, gadget ads?
3. In that ip range 66.249.71.nnn I have seen from the googlebot other strange things among them:
a) It can be redirected with URIs that constitute hack attempts, here is another entry:
66.249.71.nnn - - [01/May/2009:15:03:23 -0400] "GET /proxy.php?ter=&zon=eur&url=http://www.example.com/page.htm%7C%7C HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Now I don't have scripts like these on my server.
b) It hits the server multiple times within the same second requesting pages and many times ignoring 304 cache headers.
Plus the head request mentioned. I will try to cross-reference this on some other servers and see what else comes up.
Your item #2 is known as the "Mick Jagger theorem." Assuming that what they need is just *any* kind of response, then it is true: The can't always get what they want, but if they try sometimes, they just might find, they get want they need (even with a 403 response.) The only way to shut them down properly is at the firewall, by dumping their requests into the 'black hole' before it even reaches your server.
Depending on your permissions on the server, you can configure the "Server" header to return minimal info, such as just "Apache," and so 'hide' the names and version of your major applications, etc. See Apache core ServerTokens and ServerSignature directives.
enigma,
> But those mobile-like UAs I am not so sure. What's this, gadget ads?
These are the crawlers for Google's mobile search index. You will also see one spoofing DOCOMO for the Asian mobile search results. All the mobile UAs you posted look legitimate to me (I do have some mobile content on my sites in XHTML+XML-MP format so I see these UAs quite often.)
Jim
These are the crawlers for Google's mobile search index. You will also see one spoofing DOCOMO for the Asian mobile search results. All the mobile UAs you posted look legitimate to me (I do have some mobile content on my sites in XHTML+XML-MP format so I see these UAs quite often.)
They do this because many mobile sites use "browser sniffing" to determine how to serve content, and Google is looking for mobile content when they send these requests. Mobile devices vary so wildly in their screen size, browser rendering capabilities, and standards compliance that this approach is necessary for many Web sites, and Google is trying to support it.
So for many reasons, this is not only a good idea, but is actually needed to work properly. Rather than view this with suspicion, I suggest that you view it in the same way that you view the fact that many robots put "Mozilla/4.0 (compatible;" on the front of their actual UA, which they do for the exact same reason.
As far as I can tell, they are *all* legitimate (as long as they come from IP addresses having crawl-nn-nnn-nn-nn.googlebot.com rDNS), and I get dozens to hundreds of visits per day from these "cell-phone Googlebots" to my mobile content URLs. I also get a few to my non-mobile content URLs, as they figure out what's mobile content and what's not: XHTML+XML-MP pages vs. HTML pages, in my case.
Basically, you can match the "phone make and model" part of the user-agent string if you want to treat the request as a mobile device, or you can match the "googlebot" part of the UA if you want to treat it as a robot and then use the content-type preference list in the HTTP Accept request header to further decide what to serve. This suits both the fine/complex and course/simple mobile device support methods available to mobile webmasters.
Take a look at the Google Mobile Webmaster Help pages for more info.
Jim
The only use I have seen for the UA is whether someone enters a site differentiating a bot from a human based on the http string which may signify a spider.
Therefore, many Webmasters have learned that in order to detect mobile devices reliably, it is not sufficient just to examine the Accept header for acceptable content-types. In fact, there is a huge script/database package that is widely used to examine mobile device User-agent strings and return hundreds of parameters describing each mobile device's specific characteristics and capabilities.
You can view Google's user-agent spoofing method with suspicion if you like, but I assure you that it is not only necessary but also a very good idea. Having implemented several mobile sites, I use *all* of the information googlebot-mobile provides in its request headers to serve the proper content to each type of "device" that they claim to be, while also modifying that content slightly for googlebot so that it caches appropriate information for display in mobile search results for each device class (i.e. XHTML-MP, iMode, etc.).
Jim
It's not matter of suspicion only but it's matter of usability and following a spec as closely as possible. At the end is the HTTP_ACCEPT from the client to identify the content needed (subsequently send by the server). You cannot ignore that and rely on the UA actually. I would expect a mobile device to send the proper request headers to the server end.
Another thing is the UA info can be really vague and harder to manage based on brand signatures.
Now if you see mobile devices that they don't include a media that identifies a mobile (eg: if they send just the regular accept headers of an ordinary browser) there is something wrong with the mobile device. If I see just a */* with the request and nothing else, I will send out the default page format from the server.
Also if you check the googlebot request headers you will see it will include a reference for a wireless protocol. If you say the info on the accept header is not enough to be used as an encoding format then the rfc specs are inadequate.
What each webmaster does I am sure is up to him. Many treat a blank UA as a hidden bot or as a spammer but I don't because I also surf with the UA blocked. And reading the rfcs the UA does not signify a format encoding by any means.
Mobile devices are rushed to market due to the competitive nature of that industry, and the vendors apparently have little time for specs. A cursory review of their headers and client behaviour will aptly demonstrate this as true.
There is a lot wrong with many mobile devices.
Many devices do not include "vnd.wap" in their accept headers, and that is only one problem. By including a valid mobile device in their UA string *and* acknowledging that the request is from googlebot-mobile, Google is simply making it possible to correctly support almost all mobile devices in the most flexible and thorough manner.
You can do as you please, but I prefer to use all of the information provided in the headers, and not to turn away traffic simply because the visitor had the misfortune of buying a mobile device from a non-specifications-compliant vendor. We can point to specs and standards all day long (as we did for many, many years with MSIE's non-compliant CSS support, for example), but in the end it comes down to choosing whether or not to serve the correct content to all devices, spec-compliant or not.
A bit of pragmatism is called for.
Jim