homepage Welcome to WebmasterWorld Guest from 54.166.128.254
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Spiders Ignoring Robots.txt
Content Vs. Legal Link or Site Trespassing.
Garyh



 
Msg#: 131 posted 3:21 pm on Jan 17, 2001 (gmt 0)


Don't have a lot of time today to expound, but just saw this interesting article about a spider ignoring the robots.txt file. This months Webtechniques, Feb. 2001 page 18 talks about "House of Blues" filing a suit against Streambox over "Streamlinking" issues. [webtechniques.com...]

House of Blues said that Streambox's spider ignored the robots.txt file. Streambox has ignored what House of Blues characterizes as a type of "No Trespassing" sign on its web server and use a piece of personal property without the owners permission.

At the heart of these cases, someone has simply linked to a file without permission.

 

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 131 posted 12:43 pm on Jan 21, 2001 (gmt 0)

Very interesting and I wasn't aware of the story. Like the ebay vs bidders edge suit before it, I can't help but think House of Blues is going to win walking away (with alot of cash).
The article is located at: [webtechniques.com]

Garyh



 
Msg#: 131 posted 5:20 pm on Jan 21, 2001 (gmt 0)

Well, I am new at this and have been investigating robots and found it interesting that "some" robots are looking anyway. I have heard some are looking, not indexing, but using the information for other, perhaps insidious, evaluations of your site.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 131 posted 5:31 pm on Jan 21, 2001 (gmt 0)

In the last year, Alta, Ink, Google, and Fast have all four crawled the entire web. They certainly aren't putting all that data online and they for sure are not obeying robots.txt all the time. They send them out in hunter gather mode just to raid links and scarf up data. It is amazing what a wandering spider can run into some times. Mostly, they use the data to create link/web maps (eg: data mining operations).

skirril

10+ Year Member



 
Msg#: 131 posted 3:53 pm on Jan 22, 2001 (gmt 0)

The big four have crawled all the web in the last year?

I doubt it. I dont know where I heard it, but what I heard was that the best of them indexed a mere 25-30% of all pages out there.

Ofc, if there's any commercial interest in your site, you'll look yourself to get it listed.

Just background info.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 131 posted 4:12 pm on Jan 22, 2001 (gmt 0)

No they didn't list all they crawled - that's part of the problem. Ink crawled the whole web from june to the end of july. Google has been doing it over the last 3months (including dynamic). Fast did it in sept, and alta did it in dec99-march00.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved