I've been running across a few sites and should have been keeping a list on them. I've done a robots.txt file search and couldn't locate any blocking of a search engine spider. Or, I could be doing it wrong too. I'd notice huge directories for link exchanges with no PR and the home page with a PR of 5 or greater. Doesn't make sense, unless your blocking a search engine crawl. I'd like to know if there is another way of catching this type of activity. Maybe the directories are new, and most likely not.
An example would be without listing the direct url: <example.com/links2.html> Iíd check for a robots.txt file by just doing the basics: <example.com/robots.txt>
Any other ideas?
The below are blocking a large amount of crawlers, but I canít see how their blocking a crawl to their directory pages at
[edited by: agerhart at 6:24 pm (utc) on Feb. 17, 2004] [edit reason] please stop dropping URLs [/edit]
They could be using a header tag such as '<meta name="robots" content="noindex">'
Also they could be serving a robots.txt specific to the user-agent, you could use a tool like wget (wget -U useragent http*//domain.tld/robots.txt) and try useragent strings from your logfiles for various robots.
[edited by: pageoneresults at 12:51 pm (utc) on Feb. 18, 2004] [edit reason] Delinked Example [/edit]