Don't have a lot of time today to expound, but just saw this interesting article about a spider ignoring the robots.txt file. This months Webtechniques, Feb. 2001 page 18 talks about "House of Blues" filing a suit against Streambox over "Streamlinking" issues. [webtechniques.com...]
House of Blues said that Streambox's spider ignored the robots.txt file. Streambox has ignored what House of Blues characterizes as a type of "No Trespassing" sign on its web server and use a piece of personal property without the owners permission.
At the heart of these cases, someone has simply linked to a file without permission.
Very interesting and I wasn't aware of the story. Like the ebay vs bidders edge suit before it, I can't help but think House of Blues is going to win walking away (with alot of cash). The article is located at: [webtechniques.com]
Well, I am new at this and have been investigating robots and found it interesting that "some" robots are looking anyway. I have heard some are looking, not indexing, but using the information for other, perhaps insidious, evaluations of your site.
In the last year, Alta, Ink, Google, and Fast have all four crawled the entire web. They certainly aren't putting all that data online and they for sure are not obeying robots.txt all the time. They send them out in hunter gather mode just to raid links and scarf up data. It is amazing what a wandering spider can run into some times. Mostly, they use the data to create link/web maps (eg: data mining operations).
No they didn't list all they crawled - that's part of the problem. Ink crawled the whole web from june to the end of july. Google has been doing it over the last 3months (including dynamic). Fast did it in sept, and alta did it in dec99-march00.