"gmail.com" bots?

Forum Moderators: open

Message Too Old, No Replies

"gmail.com" bots?

A bunch are showing up

tangor

10:21 am on Nov 3, 2008 (gmt 0)

I've just .htaccess denied "gmail.com" for a batch of UA's which contain "gmail.com" with the lame "just send us an email if you don't want to be crawled" or other language.

Good thing or bad? All seem to revert to amazon's cloud and not a referral in the bunch. Which is reason enough for me. But I do wonder if I'll be taking out some potential referrals from those who actually use gmail.

wilderness

4:10 pm on Nov 3, 2008 (gmt 0)

web based email "links" are not referenced in a UA, rather in a referral field.
EX:
www.example(webBasedEmail).com/datalink/yourPage.html

jdMorgan

6:39 pm on Nov 3, 2008 (gmt 0)

Amazon's compute cloud is a disaster, with respect to validating spider requests from them. There is apparently no way to do an rDNS lookup that will return any information about the actual "using organization" -- all you get is the amazon compute cloud information.

As a result, I block all IP address ranges of the Amazon compute cloud service, and just hope that no important search company decides to use them for extra "spidering power" as-is, without requiring that Amazon configure valid rDNS for the term of the lease.

[edit] The lack of a referrer is typical for spiders, since they're working from a database that may contain hundreds of "referrers" for any given URL on your site. So I'm not sure how relevant the lack of a referrer was to your decision to block these requests, since (I assume that) the message in the user-agent string was fairly clear about the client being a crawler. [/edit]

Jim

[edited by: jdMorgan at 6:43 pm (utc) on Nov. 3, 2008]

Samizdata

7:54 pm on Nov 3, 2008 (gmt 0)

I assume the user-agent was similar to this one:

AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)

It came to a couple of my sites, didn't request robots.txt and triggered at least four filters.

I didn't have to go to the trouble of emailing the owner.

...

blend27

9:21 pm on Nov 3, 2008 (gmt 0)

The Clouds are here: [ws.arin.net...]

3 magic words added to you deny list in your .htaccess that do wonders: bot, crawler, spider after whitelist routine had run.

jdMorgan

10:09 pm on Nov 3, 2008 (gmt 0)

I use seven magic words to screen user-agent strings: bot, capture, crawl, download, http:, proxy, and spider

And I'd be happy to add more if I've missed any... :)

Note that you may want to add an exclusion for user-agents ending with "(http://www.avantbrowser.com)" ... Or not.

Jim

Megaclinium

6:16 am on Dec 18, 2008 (gmt 0)

Wow, this AISearchBot hit me and was particularly stupid: it couldn't manage case so 404'd before I 403'd it.

"gmail.com" bots?

A bunch are showing up

tangor

wilderness

jdMorgan

Samizdata

blend27

jdMorgan

Megaclinium

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week