homepage Welcome to WebmasterWorld Guest from 54.211.235.255
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt question
which engines IGNORE it?
charon




msg:1529236
 10:42 am on Aug 16, 2002 (gmt 0)

Hi All

This is probably an extremely simple question so forgive me if it sounds simplistic.

We're going to be using a robots.txt file for a new client and in the course of my research I've seen that some engines ignore. Simple question is - which ones so we can advise the client properly?

Many thanks from sunny Scotland,
Charon

 

Dreamquick




msg:1529237
 11:45 am on Aug 16, 2002 (gmt 0)

I have trouble believing any reputable/large search engine would ignore robots.txt - if this did happen that spider would find itself being physically blocked from the majority of managed websites sooner or later.

It is not unreasonable to suggest that if a search engine spider can't gain access to lots of websites this would lead to a less useful search engine, once they are in this situation then there are only so many options;

1) Fix your search engine to work with robots.txt
2) Leave the search engine business
3) Carry on and pretend that everything is fine

Obvious it's a lot easier to build a working spider (or at least learn from webmaster comments that describe where your spider is failing) than it would be to only fix your spider when lots of sites have blocked it and your business is failing as a result.

People are generally very tolerant of most things SE spiders do - this does not include ignoring robots.txt as this is a very clear cut thing as it protects both the website and the spider.

- Tony

idiotgirl




msg:1529238
 7:19 pm on Aug 17, 2002 (gmt 0)

There are several spiders that do not heed robots.txt. They don't pull it at all, or totally ignore it. Most of these are email harvesters, downloading agents, spam bots, and leechware - however. These aren't legitimate bots/spiders that will help you in the real world. So, adding them to robots.txt is generally a waste if they are a confirmed abuser.

As a rule - most bots with variations of "rip", "siphon", "harvest", "download", etc. etc. are going to disregard robots.txt and do as they please. It's probably more constructive to concentrate on the spiders you want to visit your site and instruct which pages and directories to parse through robots.txt than to try to ban rude bots through it.

If you look through the posts at WebmasterWorld you'll see lots of people report whether a bot disregarded robots.txt and form your own conclusions about who/why/what to include or exclude in your robots.txt file. Also, many of the current legitimate spiders can be found at searchengineworld.

charon




msg:1529239
 9:14 am on Aug 19, 2002 (gmt 0)

Thanks all, this is useful stuff. I guess what I really need to know is do any of the major engines ignore it? But I'll search around the other forums as recommended.

cheers
Charon

YoungstownWebMan




msg:1529240
 1:41 pm on Aug 19, 2002 (gmt 0)

As far as major search engines,

all the major search engines at the latest Search Engone Strategies 2002 conference in San Jose, CA said they all read the robots.txt as that question as brought up.

Scott Emick
WebMaster, Programmer, Analyst
Youngstown, OH

charon




msg:1529241
 1:53 pm on Aug 19, 2002 (gmt 0)

That's music to my ears, thanks Scott, much appreciated!

Cheers
Charon

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved