|Spiders Ignoring Robots.txt|
Content Vs. Legal Link or Site Trespassing.
| 3:21 pm on Jan 17, 2001 (gmt 0)|
Don't have a lot of time today to expound, but just saw this interesting article about a spider ignoring the robots.txt file. This months Webtechniques, Feb. 2001 page 18 talks about "House of Blues" filing a suit against Streambox over "Streamlinking" issues. [webtechniques.com...]
House of Blues said that Streambox's spider ignored the robots.txt file. Streambox has ignored what House of Blues characterizes as a type of "No Trespassing" sign on its web server and use a piece of personal property without the owners permission.
At the heart of these cases, someone has simply linked to a file without permission.
| 12:43 pm on Jan 21, 2001 (gmt 0)|
Very interesting and I wasn't aware of the story. Like the ebay vs bidders edge suit before it, I can't help but think House of Blues is going to win walking away (with alot of cash).
The article is located at: [webtechniques.com]
| 5:20 pm on Jan 21, 2001 (gmt 0)|
Well, I am new at this and have been investigating robots and found it interesting that "some" robots are looking anyway. I have heard some are looking, not indexing, but using the information for other, perhaps insidious, evaluations of your site.
| 5:31 pm on Jan 21, 2001 (gmt 0)|
In the last year, Alta, Ink, Google, and Fast have all four crawled the entire web. They certainly aren't putting all that data online and they for sure are not obeying robots.txt all the time. They send them out in hunter gather mode just to raid links and scarf up data. It is amazing what a wandering spider can run into some times. Mostly, they use the data to create link/web maps (eg: data mining operations).
| 3:53 pm on Jan 22, 2001 (gmt 0)|
The big four have crawled all the web in the last year?
I doubt it. I dont know where I heard it, but what I heard was that the best of them indexed a mere 25-30% of all pages out there.
Ofc, if there's any commercial interest in your site, you'll look yourself to get it listed.
Just background info.
| 4:12 pm on Jan 22, 2001 (gmt 0)|
No they didn't list all they crawled - that's part of the problem. Ink crawled the whole web from june to the end of july. Google has been doing it over the last 3months (including dynamic). Fast did it in sept, and alta did it in dec99-march00.