Forum Moderators: phranque

Message Too Old, No Replies

How 'bout a Spider Trap Tutorial...

Help us protect our work.

         

pendanticist

5:44 am on Jan 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think this issue needs to be addressed in a manner that will allow the novice to install and maintain a Spider Trap (ST) starting from the very beginning.

(Another poster has graciously provided me with what I'll call the 'inner workings' of the trap itself. That which I'm sure is, or may be, somewhat situationally specific. In other words, I don't think it's up to me to share that part without them posting whatever particulars they wish.)

If I understand the schematics of STs they can be very tricky and must be approached with extreme caution when implementing on the server side.

Having said all that, exactly what does a person have to do?

Please start from the very beginning...

Thank You.

Pendanticist.

Dreamquick

1:42 pm on Jan 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To be honest what you want to achieve with the spider-trap will affect how you design & implement it...

A simple spider tracker might require very little work, somewhere in the middle you have the smart anti-spam scripts, and at the top-end you have automated systems designed to feed things like site cloaking scripts.

Essentially the more you rely of the spider-trap to do important things within the site the more work goes into it, also you eventually reach a point where static data isn't good enough so you end up making the script self-maintaining.

Quick google turned up this link; www.spiderhunter.com/spidertrap/ which after having a little browse does seem to match my thinking on the subject, although it gives you a rough idea rather than being a complete guide.

If you'd like to explain a little more about what you were looking to do with your spider-trap then I'll try to give a little less vague advice :)

Vague route
If you want to start simply then you'll need to get yourself a set of base-data together, either by running through your logs or by finding site which lists search engine user-agents and their IP addresses.

Once you have this data you'll need to get it into some sort of searchable format, my preference is a SQL database as you can manipulate the data in any way you choose which makes querying it a lot easier.

With your data in a searchable format you'd then need to build an include which contains a simple IsSearchEngine() type of function, or perhaps a WhichSearchEngine() type function.

All these functions need to do is compare the incoming request it to the data you obtained earlier, primarily searching for IP / hostname and if that fails falling back to user-agent. If you get match you can know, with a fair degree of certainty, that what made this request was spider/crawler "X" and so set the return value of the function accordingly.

Then you simply include this script into any page which needs to be able to detect spiders/crawlers and then use the functions to create whatever logic you need.

- Tony