homepage Welcome to WebmasterWorld Guest from 54.161.236.92
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How to trap bots?
That crawls disallowed pages.
Sunnz




msg:4171971
 6:04 am on Jul 17, 2010 (gmt 0)

I got something like

User-agent: *
Disallow: /some-random-page.html

In my robots.txt to trap bad bots, and I am just monitoring my server log to see what IP are actually accessing that page.

I don't actually link to that page any where in my web site, to prevent trapping real visitors.

Are there better ways to trap badbots? Do they actually read robots.txt to find out what you don't want them to crawl, and proceed to grab that page? Or do I have to provide a link to that page somehow?

 

lammert




msg:4172000
 6:44 am on Jul 17, 2010 (gmt 0)

You can change that /some-random-page.html file in a script file with PHP for example, which automatically executes a piece of code which adds the IP of the visitor to your blacklist when that script file is called.

You should make sure that no visible links exist which regular users could use to access that page. You may otherwise lose some valuable visitors.

youfoundjake




msg:4172020
 7:51 am on Jul 17, 2010 (gmt 0)

I set up a bot trap a little almost 4 years ago, and I'm still using it.
Maybe check this out
[webmasterworld.com...]

dstiles




msg:4172162
 8:53 pm on Jul 17, 2010 (gmt 0)

My experience is that almost no bots will fall into a trap that is ONLY listed in robots.txt. No doubt a few used to but in general I think bots never read robots.txt in the first place. The only use I've found of adding the trap page to robots.txt is to prevent real SEs falling into them.

You have to have the traps on a normal (or abnormal) visited page, which page adds the IP to a list of banned IPs. Or whatever.

jdMorgan




msg:4172177
 9:18 pm on Jul 17, 2010 (gmt 0)

Another useful technique is to call the trap script from within .htaccess and/or your other scripts whenever you find something amiss with the client request. Basically, the trap script adds the requestor's IP address to a "Deny from" or SetEnvIf directive in .htaccess, and this is useful both when a bad bot fetches the trap script itself (due to a basic robots.txt violation) and also if the trap script is called (rewritten-to) if the request is in some way invalid -- Say because there are missing, incorrect, or inconsistent HTTP request headers, requests for known-exploit URLs, etc.

Note that there are both PHP and PERL trap scripts published here on WebmasterWorld. They work in different ways, but both implement the "add IP address to .htaccess ban list" function.

Jim

youfoundjake




msg:4172181
 10:00 pm on Jul 17, 2010 (gmt 0)

The only use I've found of adding the trap page to robots.txt is to prevent real SEs falling into them.

And therein lies the rub. I don't want to ban the SE's, but, if they read the robots.txt, and still get banned, then good. I've actually had slurp get banned. During the SearchMonkey conference a couple of years ago, one of the engineers there flat out said Slurp doesn't always obey.

keyplyr




msg:4172200
 10:34 pm on Jul 17, 2010 (gmt 0)

During the SearchMonkey conference a couple of years ago, one of the engineers there flat out said Slurp doesn't always obey.

Imagine that.

enigma1




msg:4172360
 10:16 am on Jul 18, 2010 (gmt 0)

I don't actually link to that page any where in my web site, to prevent trapping real visitors.

Yes but if another site posts that link on purpose or emits a redirect header for a request to the spider to your link, you may see the spider or whoever is behind to get in and trigger the trap. So I don't use this kind of trap method for this reason as it often generates false positives.

Megaclinium




msg:4173765
 1:32 am on Jul 21, 2010 (gmt 0)

maybe you should put the trap in the real pages.
I do. But I put it so is visible in .HTML source but not visible to an actual user (i.e. I link a blank space or something like that that no user would click on).

Anything that hits it will then be obviously a bot even if it has cloaked UA.

I see hack attempts to directories that don't exist also. While I 403 these IPs and their host ranges, I often setup the directories the attempt was asking for and put large nonsense binary files renamed as the file they were looking for to slow them down if they return from dfferent range.

Sunnz




msg:4174039
 1:58 pm on Jul 21, 2010 (gmt 0)

So I do need to link to that trap page somehow, I guess that makes sense, bad bots most likely won't bother to read robots.txt at all!

So what's the best way to put that link in? You don't want real visitors to click on it, do you? ;)

Hide it with CSS? (display: none; visibility: hidden; opacity: 0; position: absolute; top: -999em; width:0; height:0; z-index: -999;)

Do I have to put anything in <a> tags? Or will <a href="some-random-page.html"> </a> do?

On top of the above, wrap it inside <noscript></noscript>?

Would that degrade gracefully? What if the visitor is using lynx or a screenreader? Accessibility is important too!

How about this, on the trap script, it is an actual html page that have something like <script src="whitelist.js">, where whitelist.js is actually a serverside script, that whitelist the client IP THEN outputs some random javascript to the browser... given that bots don't load javascripts and only real visitors do...

Ahh this is harder than I thought! ~_~

Megaclinium




msg:4174114
 3:38 pm on Jul 21, 2010 (gmt 0)

I just put two links in the end of my pages

<a href="http://www.mywebsite.com/imarobot.html"> </a><br>

<a href="http://www.mywebsite.com/badrobot.html"> </a><br>

you can see that what has a hyperlinks is a blank. So I don't think it shows up anywhere and has no underline (blanks aren't underlined)

I have two for the following reason:
the first simply tells me if they ARE a robit, if the link gets followed.

the second is specifically banned in in your robots.txt file. So any that followed this blindly would be a badly behaved bot, either dumb (not smart enuf to follow robot's rules) or illiterate (never even read robots.txt)

In case a person looks at what the bot is scraping, name them something else from the above example.

maybe passwords.txt or topsecret.txt :)

enigma1




msg:4174207
 5:54 pm on Jul 21, 2010 (gmt 0)

It's not about visibility.

your HTML page has
<a href="http://your.example.com/badrobot.html"> </a><br>

and my HTML has:
<img src="http://your.example.com/badrobot.html" style="display:none" />
And everybody that comes to my site human or spider appears as a "bad bot" to you.

Requests can be forced from other sites not only from inside your domain.

And bots can process some javascript, and can be forced to follow links derived from javascripts.

A trap is good when undetected. Once detected its benefits can be reversed. And the primary method of detecting traps is via robots.txt where most like to setup the don't-go links. So reading the robots.txt can expose the badrobot.html link and then something like an img tag and ban whoever an outsider wants from entering your site.

blend27




msg:4174247
 7:28 pm on Jul 21, 2010 (gmt 0)

A trap is good when undetected


EXACTLY!

Dynamic bot trap PAGE NAMES based on the IP and server side cookie, the ones that are not shown to the good bots(or take no action and set to noindex, ok send your self an email: Cought by such-such) and NOT Shown as disalowed in robots.txt.

Robots.txt says:
User-agent: *
Disallow: /

To everybody who is not on the White List(same as on this forum).

Hit the Robots.txt(if I don't already know who you are(server farms, bad proxies('projectHoneyPot' has Mesmerizing Collection, there is a pretty good list on 'stopforumspam'), known from pre. hack/scrape attempts)), or a bot trap = next request = n redirects(good luck spoofing IP at that point)(to a dynamic page as well) OR NOT, then IS Captcha, sorry ;).. But by this time, Dorothy....it's the other way around :)

And then there is DDOS, and if that is the scenario, One has bigger problems, which are handled BY HOSTING Account Team level.

But, We don't know what we dont know, right, or left.....

Blend27

enigma1




msg:4174561
 8:23 am on Jul 22, 2010 (gmt 0)

Dynamic bot trap...

Has exactly the same false positives as static.

You cannot tell in the first place if the request made to your server by an IP is actually known to the one who did the request.

iframe with js, flash may initiate requests completely transparent to the operator of the IP.

Yes you can involve blacklists, captchas etc but they all can generate false positives and can make the user xp with your site miserable. Honeypots rely on some hidden html links being accessed which goes back to my previous post and so they can be triggered.

IMO if you truly want to see a see a security improvement to all this, your best bet is to push the browser vendors to improve their s/w. To give at least an option to users, to completely switch off retrieval of third party resources (like they do already with third party cookies).

So if you visit example.com anything that is pulled in, has to be from example.com and its sub-domains and nothing else.

And unlike what they may say to this, impacting marketing tools, rss, etc are all false. If there is a need for third party content, it can be requested by the server (1st step) and be presented to the client (2nd step) which is totally different than having the client making 100 connections to various servers left and right just because he visited a page.

Sunnz




msg:4174681
 1:58 pm on Jul 22, 2010 (gmt 0)

IMO if you truly want to see a see a security improvement to all this, your best bet is to push the browser vendors to improve their s/w. To give at least an option to users, to completely switch off retrieval of third party resources (like they do already with third party cookies).


The RequestPolicy extension for FireFox does exactly this, blocks all third party request by default and allow you to select which one a domain is allowed to make a request to where. A bit annoying at for the current 'web 2.0' style of web sites, you either get used to it or you don't.

Anyway.

Instead of listing your bot trap page in robots.txt, how about only listing the directory of the bot trap page in robots.txt.

For example, say I am actually well organised and have a directory for javascript and CSS, they have no actual contents so I list them in robots.txt:

User-agent: *
Disallow: /javascript
Disallow: /css

And put bot trap under there.

Then at least the bot trap is not listed directly in robots.txt.

The only problem then is I still have to link to that page on another page, which the any malicious person would have access to, since if we let the bot to see it, then anyone can see it. (In the html source.)

enigma1




msg:4174751
 3:37 pm on Jul 22, 2010 (gmt 0)

The RequestPolicy extension for FireFox does exactly this

That's excellent and looks pretty new. And there shouldn't be a problem with ajax because the client should pull stuff from the server he visits not from anywhere. And of course webmasters would not have to worry so much about all these spam and scrap attacks because the IP that comes in is the one that misbehaves. This leaves servers and compromised systems I think which is far easier to manage.

Question is how to convince the users to use it? (as it's not part of the browser - yet)

blend27




msg:4174818
 5:00 pm on Jul 22, 2010 (gmt 0)

enigma1,

You are talking Cyber Terrorism here ;). In order for the Flash or Iframe to be useful it has to be present on the site that wants you to ban the user(competitor) at minimum, thus someone really interested in your traffic...

-OR-

Is trying to scrape your content via JS, IFrame, Flash = USER intitiated requests or A piece of script that could handle redirects properly, holds Session Cookies, request images via CSS(backround-image) or render images that are called via JS etc... Why not wrap that in a GUI and call it a New Browser? By the time that is ready, we'd know about it here, on this forum.

And I am not sure most of the Bot masters are that sofisticated, cause there are always sites out there that will give you content and be happy that they got a Hit(as per their Analitics software)! Reminds me of a .gov webmaster that had "CONTENT IS KING" pinged to the wall of his Cubicle.

If some one makes it their priority to scrape the sites I manage, I could only wish them lots of luck, Fancy Sombrero on the head with the Kippah underneath and a big Red or Purple Flag in left hand, all thought it's much easier to CTRL+A,CTRL+C,CTRL+V.


Oh, and I love me some false positives, gives me something to do on a saturday morning.

IMO if you truly want to see a see a security improvement to all this, your best bet is to push the browser vendors to improve their s/w.


Agreed 100%

--------------------------------------
Sunnz,

User-agent: *
Disallow: /

is much-mo elegant, they dont know what they don't know..

enigma1




msg:4175291
 11:55 am on Jul 23, 2010 (gmt 0)

In order for the Flash or Iframe to be useful it has to be present on the site that wants you to ban the user(competitor) at minimum, thus someone really interested in your traffic...

Agreed. And it may happen for sites that compete for the same popular phrases, products etc in other words anything that generates revenue.

But you will have a very hard time to prove to anyone that what server-A transmits to its clients affects your business on server-B, like this.

It is logical to point fingers to s/w vendors, pc-system vendors, search engines, marketing companies etc all who use various "advertising" techniques to promote their services/products. Did you buy a pc-system recently? Because after the install procedure, half of the browser window is covered with "toolbars". The way it goes, in the coming years the viewport will be limited to a status bar size and the whole browsing experience will be automatic or with the use of popups perhaps. And the average user doesn't really know which edit box to use for the web.

So right now I cannot be sure when I see a hack, scrap, spam or any other type of evil attempt, who really is behind and if I ban an IP accomplishes anything. Especially those that come from ISPs. Instead I'm concentrating on each request.

Sunnz




msg:4175368
 2:51 pm on Jul 23, 2010 (gmt 0)

--------------------------------------
Sunnz,

User-agent: *
Disallow: /

is much-mo elegant, they dont know what they don't know..


Depends if you want good bots to traverse your site(s).

Sunnz




msg:4175370
 2:56 pm on Jul 23, 2010 (gmt 0)

Just another thought, if I got a bot trap on my site:

mysite.org/css/some-random.css

And on a malicious site badites.org they have:

<img src"http://mysite.org/css/some-random.css" />

Wouldn't visitors on their site have a referer of badite.org?

blend27




msg:4175406
 4:31 pm on Jul 23, 2010 (gmt 0)

The theory for the dynamic robots.txt is:

IF NOT GOODBOT

User-agent: *
Disallow: /

ENDIF

There is an old thread here on how to figure it out: [webmasterworld.com...]

you could see it in action here: [webmasterworld.com...]

Hope this helps

Sunnz




msg:4176057
 6:49 am on Jul 25, 2010 (gmt 0)

Oh so that's what it means to have a dynamic robots.txt, thanks a lot blend27!

Megaclinium




msg:4176505
 10:39 am on Jul 26, 2010 (gmt 0)

Hmm, enigma, not sure you understood.

link to imarobot.html is NOT in robots.txt
a regular person will not follow this link as they can't see it, just an indicator that a robot good or bad has followed the link if it shows up in raw logs

robots aren't checking if link is visible so if they are scraping they may follow links.

link to badrobot IS in robots.txt
so to follow this link it 1) has to be a robot and 2) not read or ignore robots.txt

I understand that embedded images are always pulled and purposely avoid embedded images for this reason as you mention. links are not followed automatically.

The simplicity of this means it can be used on any page. It only does these two simple things but they are useful. and due to simplicity probably can be put in any part of most types of page layouts.

enigma1




msg:4176585
 1:50 pm on Jul 26, 2010 (gmt 0)

link to badrobot IS in robots.txt

Yes and there is a disallowed link which indicates a possible trap. So the rogue bot reads it and now knows the location of a possible trap. It then updates a database on an external server about it.

so to follow this link it 1) has to be a robot and 2) not read or ignore robots.txt

Then the external server uses it as a link to an image, or a link to js file or a link to a resource etc. So no it doesn't need to follow anything, it can be accessed by anyone a human, a spider etc., directly.

Any human who accesses pages on the external server is possible to download this "resource" and will trigger your trap without even knowing what he downloads. Could be the captcha image your comments box displays too and present it to the client of the external server to "solve it".

As of spiders now external server emits directly headers and the spider will follow despite what you may think that the robots.txt lists as a restricted link.

your robots.txt disallow link:
Disallow: /some-random-page.html

External server emits headers upon request from a spider (code example in PHP)

<?php
$trigger_trap_link = 'www.example.com/sbadrobot.html?slightly_adjusted=to_get_through';

header("HTTP/1.1 301");
header("Location: ["...] . $trigger_trap_link);
exit();
?>

So even if the spider checks the robots.txt before it accesses the link ain't gonna find it because the server can setup one or more parameters to ensure the link is different. The result will be humans and spiders may trigger your trap although none of them has ever visited your site.

enigma1




msg:4176599
 2:16 pm on Jul 26, 2010 (gmt 0)

Oh so that's what it means to have a dynamic robots.txt

Yes but even that is exposed in various ways. Here is a phrase from this forum's robots.txt when accessed by spiders.

...experiments with writing a weblog in a text file usually read only by robots...

So how do I know ... along with others I guess, if the content is dynamic and only displays to some spiders.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved