Page is a not externally linkable
-- Yahoo Search Engine and Directory
---- Strange 404s from Yahoo Slurp
caribguy - 5:34 pm on Jun 25, 2010 (gmt 0)
In a perfect world
|It should crawl only the pages it knows about and nothing else. |
But in reality, anything (Yahoo, probably being among the lesser evils) can and will "sniff around." Unless you are able to personally monitor and vet every single access in real time (impossible), I would suggest a more pragmatic and scalable approach. I'd also try to make it more fine-grained than "block all of Yahoo" and include multiple lines of defense.
Here are some examples:
Like dstiles, I have a spider trap on my homepages - but the (nofollow) linked file is also disallowed in robots.txt. Well behaved bots avoid the link. The system logs and notifies me of anything that hits the trap. I manually follow up on notifications. Sometimes a trap visit is indicative of a larger problem.
Generally speaking there are no files with .php .exe .dll .htm or .txt extensions on my sites. Since I consider this type of probe to be a more direct form of aggression, the follow up more direct.
Certain sites are meant for specific markets. A variety of ip ranges (sometimes whole countries and backbones) do not have any legitimate use for the content. These are blocked by default.
Malformed user agents indicate a problem on the visitor end or worse (either a misguided enthusiast or trojaned/zombie pc). They're allowed to visit only a subset of 'known good' pages and perform specific actions.
Implementing a ruleset that is based on whitelisted usage patterns and visitor behavior makes it quite easy to sleep at night :) Obviously, the rules are dynamic and have to be evaluated from time to time, but the result is that I can spend my time on creating content rather than chasing after every single incident...
Thread source:: http://www.webmasterworld.com/yahoo_search/4152420.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com