Forum Moderators: open

Message Too Old, No Replies

Universal algo to identify a spider

         

itamar

11:38 pm on Oct 25, 2000 (gmt 0)



Hi guys

I wrote a small script to cloack my pages based on ip. It works great. BUT I would like to write an automatic script that will identify the majorty of crawlers to make this job almost workless :)

and I need some help

I already have a small script to catch any ip that asks for robots.txt on the fly brfore it crawl my site but i feel it not sufcient and it can be worked by "cheaters"

I would love to get any input on what more universal parameter i can get this script to verify

thanks
Itamar

Air

1:18 am on Oct 26, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WmW itamar.

Well let's see if we can get some of these parameters uncovered. I'll start with one - we know that all of the majors arrive without a referer. Adding a check for referer in addition to robots.txt would be a start.

littleman

1:41 am on Oct 26, 2000 (gmt 0)



Referer is the key to cutting out some noise. That will be a pretty good filter to remove most the surfers. So what is left over is mostly bots - some you will care about and some you wont. You can't really really on user agent exclusively, but you could use into help categorize. You could screen the IP agents your list and known class Cs of bots. C You could also do an NS lookup or even an arin lookup when the NS fails.

drbill

2:40 pm on Oct 26, 2000 (gmt 0)

10+ Year Member



Hi Guys,

I have a script that does this but it breaks down because you get the odd :) SEO that will check pages by cutting and pasting the URL into the browser and then he has access to the code of the page.. I shut that bad boy down as soon as I got burnt :(

User Agent is easy to spoof and the non refer is not a safe because of the reasons above.

I am still working on something to do this right I would like to hear any more ideas if you have them...

Brett_Tabke

12:52 pm on Oct 27, 2000 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Robts, then referrers, and agent. If you know how to resolve ips on the fly, you can do a 'short' check for hostnames of the major se's.

itamar

4:30 pm on Oct 27, 2000 (gmt 0)



Hi Again

So far I wrote 2 scripts one that runs on the logs and one that runs on the fly when a page is being served.

Both scripts do the same:
1. Check access to robots.txt
2. Get host name from IP
3. Check for known agents
4. Check for known class c
5. Check for known host

The way scripts works is the more info i have in the Mysql db the more acuratly it works. The scripts spits out 2 list known spiders and suspected spider.

So far I gathered from my logs and on the fly 2804 ip that are SE a portion of it is just ip that I suspect are SE but not sure.

I would like to share the list with someone who have the time and could help me clean the list

Itamar