homepage Welcome to WebmasterWorld Guest from 23.20.44.136
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How to ban a crawler that I can't see
I know I'm being crawled, but I can't find the bot
arieng




msg:4157818
 7:16 pm on Jun 23, 2010 (gmt 0)

First of all, I'll admit that I'm a bit overwhelmed by this particular forum. We have a fair-to-middling robots.txt file, and ban the occasional bot that goes way overboard in violating it, but for the most part we're pretty hands-off. It hasn't really become a huge problem for us in terms of bandwidth or content scraping. We're strictly ecommerce and yeah we get scraped from time to time, but we haven't seen anything that's makes us think it affects our bottom line.

However, we're in an industry where the biggest of big ecomm sites has started showing some significant interest. I know that this particular company has a price matching program that crawls a list of competitor sites and automatically matches the lowest price. I also heard through the grapevine that our site is on that list of sites being price matched.

I'd love nothing more than to ban their bot. However, I've looked through our logs and I don't see anything that looks suspicious. They must be masking their user agent and who knows what else, and I just don't know enough about how to proceed.

Does anyone have any tricks for identifying these camouflaged bots? Based on anecdotal information and gut feelings, I think there may be several competitors scraping our prices.

*** SIDE NOTE ***

About a year ago, we built a similar program to scrape pricing from some competitors who always seemed a step ahead of us on the pricing of competitive products. I felt that if they were doing it to us it was fair game. I started a discussion here about the ethics of what we were doing, and learned that it wasn't maybe as white hat as I'd thought. FWIW, we've shelved that project based on the feedback I got here, particularly that of one of this forum's mods. We really are trying to be one of the good guys, but its getting flippin' hard.

 

jdMorgan




msg:4157933
 9:42 pm on Jun 23, 2010 (gmt 0)

Look for "browser" user-agents coming from hosting companies -- Web servers don't use browsers. Look up the reverse-DNS of all requests, or perhaps your 'stats' have a setting or report that may be useful in this regard. Otherwise, you could write a small script to do it, and log the results, eliminating duplicates to keep the data manageable in size.

Look for things that are "wrong" with the requests. Compare all aspects of the the HTTP requests to a real browser.

In most cases, you will find omissions, outright errors, or discrepancies.

Obviously, I can't say any more, since I don't want to tell these bozos how to scrape *my* sites...

Jim

arieng




msg:4158466
 3:20 pm on Jun 24, 2010 (gmt 0)

Pretty much as I expected, doesn't sound like there's any easy, foolproof way to find a crawler that doesn't want to be found. I'll try what you mentioned, but I'm not holding out much hope.

wilderness




msg:4158567
 4:48 pm on Jun 24, 2010 (gmt 0)

Pretty much as I expected, doesn't sound like there's any easy


if your familiar with the page and directory structure of your website (s), as well as understanding the interpretation of your raw visitor logs?

There's NOT any reason why you would not be able to recognize either solitary IP range or a group of IP ranges working in unison, both to crawl your site (s).
Unfortunately, seeking guidance from another whom is absent "page and directory structure of your website (s), as well as understanding the interpretation of your raw visitor logs", is not an option! There's no way possible for this otherwise-outside-person to understand the goals and structure of your website (s).

You seem to be interested in some out-of-the-box package (as are many others), when such a package does not exist.

arieng




msg:4158601
 5:27 pm on Jun 24, 2010 (gmt 0)

Hi Wilderness, as I mentioned initially we're not that technically advanced on this side of things. My background is in marketing and a lot of our technical savvy is outsourced. I'm not sure there's anyone here internally that has the know-how to really crack this nut.

You seem to be interested in some out-of-the-box package


You're right, that's exactly what I was hoping for!

such a package does not exist


And that's the answer that I was expecting, but I thought it worth asking.

I'll dig into this as best I can and see what I can find. Thanks to both of you for your knowledgeable input.

blend27




msg:4158945
 3:34 am on Jun 25, 2010 (gmt 0)

CSS File
#header .navUtility {display: block; text-indent: -9999px;}

On first page hit this bit of code is presented inside HTML
<div id="header"><a class="navUtility" href="/botTrap.ext">Some short text here</a></div>, I make it an H3/H4.

Anyone(IP) who makes into /botTrap.ext page is banned automatically until I manually review(IP Range) what and who it was without any disadvantage to my sites. /botTrap.ext is <noindex, noarchive>. This is straight cloaking based in IP Ranges for "Good Bots".

Most of scrapers/comment spammers run via proxies, same lists are out there for the taking by white hats, comparing IP, per session, against PHP(Project Honey Pot) works wonders, there are other sources as well.

One thing I realized a while back is trying to be scared to ban the user is more costly than to actually doing it in a long run time wise. I dropped POLITICALY CORRECT a while back and donít loose any sleep over it for the past several years.

Works for me for quiet a while.

arieng




msg:4159288
 3:41 pm on Jun 25, 2010 (gmt 0)

Thanks Blend. That's a fine idea. I'll present it to our programming team and see if they can put something together.

dstiles




msg:4159444
 7:19 pm on Jun 25, 2010 (gmt 0)

Blend27 - I have a similar trap on several sites. I get only two or three traps a month on it. I trap a lot more on another type of link trap. But on the whole most of the NEW traps are made on user-agents and headers. Once trapped most of them are then IP-trapped.

My own experience of scrapers/spammers is that they come direct from dynamic IPs of countries such as China and India or, far more often, from servers (many, many from USA) pretending to be browsers or even labelling themselves as bots hoping to fool people.

Arieng - If you have the capability, block every server farm you find except, obviously, genuine bots that YOU want to let through (I currently have about 3000 server ranges blocked and still adding more daily). If you have no customers in China, Korea, India etc then block those IP ranges as well. Also Ukraine.

blend27




msg:4159600
 11:25 pm on Jun 25, 2010 (gmt 0)

Also Ukraine.


Beautiful Country! Spent my 14, 15 and 16th summers there!

I do speak the language and run 2 e-com sites that are mirrored in spoken language in that country due to the fact that Items we move are originated from the region. Most of the "Shysters" I see are from 195.* and 91.* ranges, all though that first octset is not allocated to just Ukraine.

As I mentioned before, this is not new, so I know for a fact that people from Mostly Banned countries like China, India, Ukraine and such donít really complain that much when they are faced with a CAPTCHA.

One suggestion to the ones just getting in on it: Donít loose your sleep over it. Get your content indexed first, before any one sees it. That cuts down on 90+% of the consequences, after that it is a full time job if you choose toÖ.

dstiles




msg:4159972
 9:54 pm on Jun 26, 2010 (gmt 0)

I don't block Ukraine or, indeed, any of the other countries I mentioned - at least not at the moment (apart from servers and the occasional baddy).

Some of my web sites have no interest to certain countries - eg UK-only shops - so I have in place an option to block per domain. Time being what it is this has not yet been implemented, mainly because I need to do a bit more testing.

enigma1




msg:4161174
 8:53 am on Jun 29, 2010 (gmt 0)

Anyone(IP) who makes into [1.example.com...] page is banned automatically until I manually review(IP Range) what and who it was without any disadvantage to my sites.


And what happens when your competitors set up on their website:

<img src="http://1.example.com/botTrap.ext" style="display:none" />

You ban every single visitor/bot that goes to your competitor's sites isn't it? as they will pull the trap file from your server. Possibly you will ban your own customers too.

PS: Added the example.com in the quote text for the example.

blend27




msg:4161557
 7:24 pm on Jun 29, 2010 (gmt 0)

Hi enigma1,

Very Good point!

Fortunately 4 me botTrap.ext is a dynamic file name that only "exists" for that IP per specific amount of time. Sometimes there is no .ext in the URL either. ;)

enigma1




msg:4161620
 9:07 pm on Jun 29, 2010 (gmt 0)

My point is, if a trap is identified it can be used by others for purposes other than you think.

Sure you can specify an identifier based on IP/Date, time, millisecond if you like. And encode it in the most secure way it exists today.

But if the trap is known, the other site can send an iframe with some js to his client end, executing the js code from within the iframe and that code can process access links and pages from your site and the visitor won't even know it. I mean we can take it further if you like and SOP won't apply.

Also think about the ppc encoding of links spiders do which "looks" pretty secure (a similar approach like the one you mentioned to avoid false positives), sponsors have these long link lines..., because the encoding isn't published.

But in the end makes no difference. The moment the client has insecure scripting running, flash, js etc many types of attacks become possible. And there is lots of it, estimates put 90-95% of human visitors having js running today.

dstiles




msg:4161692
 10:51 pm on Jun 29, 2010 (gmt 0)

None of my hidden methods has given me false positives and I can't see how they could be used as an exploit or other attack. They are simple links, nothing more, with an instruction not to follow them and a block on the target in robots.txt. If they follow the link they will get canned with a 403, which they have probably already received for a previous offence or simply because they are from a web server.

rowan194




msg:4161797
 1:54 am on Jun 30, 2010 (gmt 0)

Quick tip: most bots don't bother to load images. If you have access to raw server logs, you should be able to see IPs which are only loading a single HTML object, rather than rendering a complete page by loading multiple images and other objects.

For example...

/
/someotherpage.php?variable=abc
/someotherpage.php?variable=123

Rather than...

/
/images/logo.jpg
/images/somethumb.jpg
/someotherpage.php?variable=abc
/images/logo.jpg
/images/content/thispage.jpg
/someotherpage.php?variable=123
/images/blah/blah.jpg

etc.

The advantage of this method is that it's 100% passive, the only false positive will come from your own interpretation of the logs. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved