Forum Moderators: open

Message Too Old, No Replies

Spider Trap page

Is this a risk?

         

biggles

1:00 am on Mar 26, 2003 (gmt 0)

10+ Year Member



I'm trying to understand why a client's website has a hidden link to a page called spider_trap.html. Opening this link generates the following message:

403 Forbidden - Spider Trap
You don't have permission to access /spider_trap.html on this server. Furthermore, you have fallen into our spider trap and all access from your IP address is now blocked.

If you believe this is in error, please email the webmaster at XXXXX.com describing how you might have fallen into a trap designed to capture web spiders and robots.

After visiting the above page I can still access the site apart from the robots.txt file. When I do this message is generated:

Your attempt to access /robots.txt on this server has been forbidden by our spider trap. This means that at some time your IP address xxx.xx.xx.xxx has broken the rules of the Robot Exclusion Protocol, or accessed the spider trap page.

Accessing the robots.txt file another way shows it to be as follows:

User-agent: *
Disallow: /spider_trap.html

From this I've concluded that the spider_trap page is to nail spiders that ignore the robots.txt instruction to avoid the spider_trap.html page - i.e. bad bots that don't obey the robots.txt protocol get excluded.

Cloaking does not appear to be being used when I use Brett's spider sim to view the page. Not sure if I'm correct in this, but I understand spider traps are a key part of cloaking and I'm concerned that if there is a page called spider_trap.html this may be a red flag to SE's & have them think cloaking is being used.

Any guidance on this would be appreciated. Thanks

Receptional Andy

1:10 am on Mar 26, 2003 (gmt 0)



2 things that spring to my mind after reading this post:

>>hidden link

Might be enough to cause you trouble with many major search engines.

>>After visiting the above page I can still access the site apart from the robots.txt file

So bad bots can't access robots.txt. I didn't think they bothered anyway?

jdMorgan

1:27 am on Mar 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



biggles,

Yes, it sounds like the trap is - to use one of my favorite made-up words - "mis-implemented".

The spider trap should block access to all pages on the site except for robots.txt.

I wouldn't worry about one hidden link on a page that leads to a spider trap file which has been disallowed in robots.txt, especially when the file is actually named "spider_trap". You're not going to get banned for this without a human review.

Cloaking is defined (by Google) as an attempt to mislead. This is not an attempt to mislead, it is an enforcement of robots.txt. Search engines will not follow that link because it is disallowed. Therefore, it obviously cannot be interpreted as an attempt to get them to index something other than what a human would see. I wouldn't (and don't) worry about it... Had two kills in my trap just yesterday... :)

A search here on WebmasterWorld for "bad bot script" will turn up several spider trap threads, and may give you a good idea how to fix your client's site, including fixing the trap.

Jim