How to ban a crawler that I can't see - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

How to ban a crawler that I can't see

I know I'm being crawled, but I can't find the bot

arieng

7:16 pm on Jun 23, 2010 (gmt 0)

10+ Year Member

First of all, I'll admit that I'm a bit overwhelmed by this particular forum. We have a fair-to-middling robots.txt file, and ban the occasional bot that goes way overboard in violating it, but for the most part we're pretty hands-off. It hasn't really become a huge problem for us in terms of bandwidth or content scraping. We're strictly ecommerce and yeah we get scraped from time to time, but we haven't seen anything that's makes us think it affects our bottom line.

However, we're in an industry where the biggest of big ecomm sites has started showing some significant interest. I know that this particular company has a price matching program that crawls a list of competitor sites and automatically matches the lowest price. I also heard through the grapevine that our site is on that list of sites being price matched.

I'd love nothing more than to ban their bot. However, I've looked through our logs and I don't see anything that looks suspicious. They must be masking their user agent and who knows what else, and I just don't know enough about how to proceed.

Does anyone have any tricks for identifying these camouflaged bots? Based on anecdotal information and gut feelings, I think there may be several competitors scraping our prices.

*** SIDE NOTE ***

About a year ago, we built a similar program to scrape pricing from some competitors who always seemed a step ahead of us on the pricing of competitive products. I felt that if they were doing it to us it was fair game. I started a discussion here about the ethics of what we were doing, and learned that it wasn't maybe as white hat as I'd thought. FWIW, we've shelved that project based on the feedback I got here, particularly that of one of this forum's mods. We really are trying to be one of the good guys, but its getting flippin' hard.

jdMorgan

9:42 pm on Jun 23, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Look for "browser" user-agents coming from hosting companies -- Web servers don't use browsers. Look up the reverse-DNS of all requests, or perhaps your 'stats' have a setting or report that may be useful in this regard. Otherwise, you could write a small script to do it, and log the results, eliminating duplicates to keep the data manageable in size.

Look for things that are "wrong" with the requests. Compare all aspects of the the HTTP requests to a real browser.

In most cases, you will find omissions, outright errors, or discrepancies.

Obviously, I can't say any more, since I don't want to tell these bozos how to scrape *my* sites...

Jim

arieng

3:20 pm on Jun 24, 2010 (gmt 0)

10+ Year Member

Pretty much as I expected, doesn't sound like there's any easy, foolproof way to find a crawler that doesn't want to be found. I'll try what you mentioned, but I'm not holding out much hope.

wilderness

4:48 pm on Jun 24, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Pretty much as I expected, doesn't sound like there's any easy

if your familiar with the page and directory structure of your website (s), as well as understanding the interpretation of your raw visitor logs?

There's NOT any reason why you would not be able to recognize either solitary IP range or a group of IP ranges working in unison, both to crawl your site (s).
Unfortunately, seeking guidance from another whom is absent "page and directory structure of your website (s), as well as understanding the interpretation of your raw visitor logs", is not an option! There's no way possible for this otherwise-outside-person to understand the goals and structure of your website (s).

You seem to be interested in some out-of-the-box package (as are many others), when such a package does not exist.

arieng

5:27 pm on Jun 24, 2010 (gmt 0)

10+ Year Member

Hi Wilderness, as I mentioned initially we're not that technically advanced on this side of things. My background is in marketing and a lot of our technical savvy is outsourced. I'm not sure there's anyone here internally that has the know-how to really crack this nut.

You seem to be interested in some out-of-the-box package

You're right, that's exactly what I was hoping for!

such a package does not exist

And that's the answer that I was expecting, but I thought it worth asking.

I'll dig into this as best I can and see what I can find. Thanks to both of you for your knowledgeable input.

blend27

3:34 am on Jun 25, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

CSS File
#header .navUtility {display: block; text-indent: -9999px;}

On first page hit this bit of code is presented inside HTML
<div id="header"><a class="navUtility" href="/botTrap.ext">Some short text here</a></div>, I make it an H3/H4.

Anyone(IP) who makes into /botTrap.ext page is banned automatically until I manually review(IP Range) what and who it was without any disadvantage to my sites. /botTrap.ext is <noindex, noarchive>. This is straight cloaking based in IP Ranges for "Good Bots".

Most of scrapers/comment spammers run via proxies, same lists are out there for the taking by white hats, comparing IP, per session, against PHP(Project Honey Pot) works wonders, there are other sources as well.

One thing I realized a while back is trying to be scared to ban the user is more costly than to actually doing it in a long run time wise. I dropped POLITICALY CORRECT a while back and don�t loose any sleep over it for the past several years.

Works for me for quiet a while.

arieng

3:41 pm on Jun 25, 2010 (gmt 0)

10+ Year Member

Thanks Blend. That's a fine idea. I'll present it to our programming team and see if they can put something together.

dstiles

7:19 pm on Jun 25, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Blend27 - I have a similar trap on several sites. I get only two or three traps a month on it. I trap a lot more on another type of link trap. But on the whole most of the NEW traps are made on user-agents and headers. Once trapped most of them are then IP-trapped.

My own experience of scrapers/spammers is that they come direct from dynamic IPs of countries such as China and India or, far more often, from servers (many, many from USA) pretending to be browsers or even labelling themselves as bots hoping to fool people.

Arieng - If you have the capability, block every server farm you find except, obviously, genuine bots that YOU want to let through (I currently have about 3000 server ranges blocked and still adding more daily). If you have no customers in China, Korea, India etc then block those IP ranges as well. Also Ukraine.

blend27

11:25 pm on Jun 25, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Also Ukraine.

Beautiful Country! Spent my 14, 15 and 16th summers there!

I do speak the language and run 2 e-com sites that are mirrored in spoken language in that country due to the fact that Items we move are originated from the region. Most of the "Shysters" I see are from 195.* and 91.* ranges, all though that first octset is not allocated to just Ukraine.

As I mentioned before, this is not new, so I know for a fact that people from Mostly Banned countries like China, India, Ukraine and such don�t really complain that much when they are faced with a CAPTCHA.

One suggestion to the ones just getting in on it: Don�t loose your sleep over it. Get your content indexed first, before any one sees it. That cuts down on 90+% of the consequences, after that it is a full time job if you choose to�.

dstiles

9:54 pm on Jun 26, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't block Ukraine or, indeed, any of the other countries I mentioned - at least not at the moment (apart from servers and the occasional baddy).

Some of my web sites have no interest to certain countries - eg UK-only shops - so I have in place an option to block per domain. Time being what it is this has not yet been implemented, mainly because I need to do a bit more testing.

enigma1

8:53 am on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Anyone(IP) who makes into [1.example.com...] page is banned automatically until I manually review(IP Range) what and who it was without any disadvantage to my sites.

And what happens when your competitors set up on their website:

<img src="http://1.example.com/botTrap.ext" style="display:none" />

You ban every single visitor/bot that goes to your competitor's sites isn't it? as they will pull the trap file from your server. Possibly you will ban your own customers too.

PS: Added the example.com in the quote text for the example.

blend27

7:24 pm on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hi enigma1,

Very Good point!

Fortunately 4 me botTrap.ext is a dynamic file name that only "exists" for that IP per specific amount of time. Sometimes there is no .ext in the URL either. ;)

enigma1

9:07 pm on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

My point is, if a trap is identified it can be used by others for purposes other than you think.

Sure you can specify an identifier based on IP/Date, time, millisecond if you like. And encode it in the most secure way it exists today.

But if the trap is known, the other site can send an iframe with some js to his client end, executing the js code from within the iframe and that code can process access links and pages from your site and the visitor won't even know it. I mean we can take it further if you like and SOP won't apply.

Also think about the ppc encoding of links spiders do which "looks" pretty secure (a similar approach like the one you mentioned to avoid false positives), sponsors have these long link lines..., because the encoding isn't published.

But in the end makes no difference. The moment the client has insecure scripting running, flash, js etc many types of attacks become possible. And there is lots of it, estimates put 90-95% of human visitors having js running today.

dstiles

10:51 pm on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

None of my hidden methods has given me false positives and I can't see how they could be used as an exploit or other attack. They are simple links, nothing more, with an instruction not to follow them and a block on the target in robots.txt. If they follow the link they will get canned with a 403, which they have probably already received for a previous offence or simply because they are from a web server.

rowan194

1:54 am on Jun 30, 2010 (gmt 0)

10+ Year Member

Quick tip: most bots don't bother to load images. If you have access to raw server logs, you should be able to see IPs which are only loading a single HTML object, rather than rendering a complete page by loading multiple images and other objects.

For example...

/
/someotherpage.php?variable=abc
/someotherpage.php?variable=123

Rather than...

/
/images/logo.jpg
/images/somethumb.jpg
/someotherpage.php?variable=abc
/images/logo.jpg
/images/content/thispage.jpg
/someotherpage.php?variable=123
/images/blah/blah.jpg

etc.

The advantage of this method is that it's 100% passive, the only false positive will come from your own interpretation of the logs. :)