AntiCrawler - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

AntiCrawler

New bot crawler

dstiles

5:44 pm on Jan 17, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Seen a couple of times over the past couple of days from different countries including a rarely-encountered VC (St Vincents & Grenadine) to a site that is purely UK-centred.

The referer is http[://]www[.]anticrawler[.]com (my brackets). Major header fields present.

Domain creation date 2015-01-15, hosted at WorldStream NL and with no MX in DNS. No prescence yet in ixquick but may be in G (which I never use).

The whole text of the home page seems to be...

==========
Put this JS to all pages of your website and you'll never see BAD bots and crawlers
(then some javascript to download a plugin)
Unique technology that pings crawlers NOT to crawl your website.
==========

Which obviously does not work - at least, not in any real sense. I got the above using wget and loading the result into gedit, so it does not even protect its own site. Offhand I can think of several bots that completely ignore javascript and would certainly ignore any pings (if that's what it's actually doing).

I have no idea what the plugin really does - could be benign or virus - but since the web site is being promoted via referer it has to be considered a spammer at best. If it becomes a nuisance it will go into my server's firewall.

aristotle

3:08 pm on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

i wonder if you can use anti-crawler's plugin to protect your site from any more visits from anti-crawler itself

dstiles

8:24 pm on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Shouldn't think so. Would you build like that? :)

keyplyr

10:39 pm on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Well if it has "crawler" in the UA then it's blocked at my sites :)

trintragula

1:38 pm on Feb 14, 2015 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Similar description from a website referer spamming as buttons-for-website dot com. I think it's probably the same people. I've not looked very closely.
The domain has been around since October.

Well I suppose you've got to admire their targeted marketing strategy...

blend27

2:32 pm on Feb 14, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

next they will be offering SEO services and home made pizza.

_{note to self: block everything with "pizza" in UA,... starting tomorrow.

add pool-100-36-73-221.washdc.fios.verizon.net(sufog.num) AND static-71-177-184-59.lsanca.fios.verizon.net(gimme60.num) to the list while you are at it.}

blend27

2:44 pm on Feb 14, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

_{reply to self: I know, I know. I will I will.}

lucy24

6:59 pm on May 19, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

:: bump ::

static-71-177-184-59.lsanca.fios.verizon.net(gimme60.num)

They must have got too many UA-based lockouts, because they are now calling themselves

Mozilla/5.0 (compatible; GimmeUSAbot/1.0; +http://gimmeusa-update.com/crawler)

with the ever-popular URL That Leads Straight To A 404 Page.

Some spot-checking leads me to 71.189.128.0/17 (Verizon Business, unfortunately including humans even within this Business subrange). The only robots I've personally met live at 71.189.164.218 with various UAs over the years, including a slightly questionable FF 16. I guess I'd better change the UA lockout to "gimme", lower case. (They've used both "Gimme60bot" and "gimme60bot" but the URL is always lower case.)

Notes say "also 71.177.184.59" which is where we came in.

trintragula

8:51 pm on May 19, 2015 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I've seen the same UAs from the same IPs. Both have ignored the instructions in robots.txt on my site and poked their noses where they shouldn't.

The robots page is missing .html on the end in the UA, but it does exist with it on their site.

They describe their robots.txt compliance there. If they are even as compliant as they say, that's a change at some point between February and now,

Pfui

12:07 am on May 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Same UA on May 18th also from: static-71-189-164-218.lsanca.fios.verizon.net (71.189.164.218). No robots.txt

Eight UA variations, including mixed-cases, associated with the IP here -- since 2013: [projecthoneypot.org...]

Quick gotcha: Gimme [NC]

lucy24

2:29 am on May 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Gimme [NC]

Have you met anything other than "Gimme" or "gimme"? That's [Gg]imme. I realized I could go to lower-case alone when I checked back and found that the UA string always includes an URL, which is always lower-case.

Pfui

12:26 pm on May 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I've rarely seen the bot but it's been on my banned list for a long time.

FWIW, my 'standard' banned UAs are in alphabetic [NC] arrays akin to this example --

RewriteCond %{HTTP_USER_AGENT} (Geckh|GIGRIBt|Gimme|girafa|grab|GREED|groovier|GUI|Gulp|Gungho|gURL) [NC,OR]

-- so I don't have to keep up with changed-case variations that might come along.

trintragula

1:23 pm on May 20, 2015 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I take the opposite approach: I have a small number of fairly permissive patterns for things to allow through (which matches all known and most unknown browsers, and the few search engines I want) - everything else is blocked. But this is backed up with a battery of other detectors.

lucy24

6:05 pm on May 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Wherever possible, I put my UA-based lockouts in mod_setenvif, where the form is either BrowserMatch or, if absolutely necessary, BrowserMatchNoCase. This lets me put the list in a preliminary separate htaccess file shared by all sites, reserving mod_rewrite for site-specific htaccess files. Of course this only works on shared hosting if you've got a "userspace"-based setup.

The problem with [NC] or NoCase is that the server doesn't just check for "Gimme" and "gimme" as if the flag meant "Title Case Optional". It has to continue checking for both I and i, M and m and so on.

The problem with mod_rewrite is the shooting-flies-with-an-elephant-rifle aspect. If mod_security could be used in htaccess, I'd use it for the more complicated if/then constructions and I wouldn't need to use mod_rewrite for access control at all.

Pfui

6:17 pm on May 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I take the opposite approach:

To blacklist or whitelist has long been a topic here. I switched to whitelisting years ago after mod IncrediBILL convinced me of its utility. (waves to B) Thing is, the more exploits started mimicking regular browsers, the more difficult it became to keep the bad guys out. So a special 'banned UAs' section acts as a belt-and-suspenders solution for me. Ditto banned URIs, REFs, etc.

At the risk of going too far off post-topic... Generally speaking, what other kinds of detectors do you employ? (And on which platform?)

trintragula

10:29 pm on May 20, 2015 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Checking the useragent will only catch the bots that are willing to identify themselves (or are careless) - but that does account for a lot of them.
Because my site is a customised forum, I use some PHP code and several SQL tables that I added to the forum software and database.
Besides the useragent whitelisting, I also check some of the other common headers (another callout to incrediBill :) - the language header works particularly well). I have a table that tracks whether visitors are picking up supporting files and a trap which stops them if they don't. I do speed checks and also some basic repetitive/excessive behaviour checks based on patterns of recent activity.
I have a couple of invisible links, one of which points at a robotted out file, so I can detect crawlers, both good and evil. This will usually stop crawlers after 3-4 requests.
I also have a table of IPs for visitors that have been caught by some of the traps, so they don't get to do anything further.
There's also a botnet detector, which watches for similar characteristics across recent visitors. (Some of the other multi-request checks take a few hits to kick in, so a few of the botnets were slipping past sometimes.) It's pretty simplistic at the moment, but seems to do the job.
I also stop visitors who use a ridiculous number of different user agents. Some of the spammers will do this - presumably to evade blacklisted useragents.
All in all I have about a dozen traps, and have been running them for a couple of years now.
I have more traps in testing, and I'm always on the lookout for new ideas.

frankjance

9:06 pm on Jul 31, 2015 (gmt 0)

10+ Year Member

I just got hit with the GimmeUSA bot, which brought me to this thread. Has anyone here tried Anti-Hammer? (http://corz.org/server/tools/anti-hammer/) It seems to be pretty powerful, allowing you to ban by agent, IP, referrer, etc. It's a PHP script that gets called by your ini.php file, but it's pretty quick and I don't notice any slow down because of it.

Just curious what others thought about it.

Thanks,
Frank at SurfShopCART

keyplyr

4:02 am on Aug 1, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I allow GimmeUSA bot. I sell things, so...

Not only that, but if you publish ads (Adsense, Microsoft Ads, etc) or sell ad space yourself, it's wise to research what marketing companies do biz with whom. It may contribute to higher bidding.

keyplyr

6:02 am on Aug 3, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

BTW - gimmi60bot, gimmie60bot, gimmiusabot & gimmieusabot also use Google proxy. I just verified that my products were included in their SERP and the GET requests came from:

UA: Mozilla/5.0 (compatible) Feedfetcher-Google;(+http://www.google.com/feedfetcher.html)
HOST: rate-limited-proxy-66-249-92-20.google.com

So... IMO this suggests that actors can get access to web page data via Google's free feedfetcher service, a pretty sneaky tactic! (even though I don't object to this particular actor.)