Forum Moderators: open
I wrote a small script to cloack my pages based on ip. It works great. BUT I would like to write an automatic script that will identify the majorty of crawlers to make this job almost workless :)
and I need some help
I already have a small script to catch any ip that asks for robots.txt on the fly brfore it crawl my site but i feel it not sufcient and it can be worked by "cheaters"
I would love to get any input on what more universal parameter i can get this script to verify
thanks
Itamar
I have a script that does this but it breaks down because you get the odd :) SEO that will check pages by cutting and pasting the URL into the browser and then he has access to the code of the page.. I shut that bad boy down as soon as I got burnt :(
User Agent is easy to spoof and the non refer is not a safe because of the reasons above.
I am still working on something to do this right I would like to hear any more ideas if you have them...
So far I wrote 2 scripts one that runs on the logs and one that runs on the fly when a page is being served.
Both scripts do the same:
1. Check access to robots.txt
2. Get host name from IP
3. Check for known agents
4. Check for known class c
5. Check for known host
The way scripts works is the more info i have in the Mysql db the more acuratly it works. The scripts spits out 2 list known spiders and suspected spider.
So far I gathered from my logs and on the fly 2804 ip that are SE a portion of it is just ip that I suspect are SE but not sure.
I would like to share the list with someone who have the time and could help me clean the list
Itamar