Forum Moderators: goodroi

Message Too Old, No Replies

Robots exclusion revisted

How far have we come with controlling robots?

         

pixel_juice

11:50 pm on Jul 5, 2006 (gmt 0)

10+ Year Member



An oldie but well worth a read:

[kollar.com...]

10 years on, spiders of every kind are visiting as many sites as they can, as often as possible. Of course, site owners should be able to turn away unwanted spiders using the robots exclusion protocol (either a robots.txt file or a robots meta tag).

How do webmasters feel about the robots exclusion protocol? Has an adequate standard been established, and is the documentation sufficient?

To quote from the article, "[webmasters] will usually let you know if they think that [robots.txt] is being accessed too often :)".

Do today's webmasters feel that they have a way to respond to the spiders visiting their sites? How can the internet community ensure that undesired or uncontrolled spidering doesn't occur?

From a personal perspective, there seems to be shortcomings with robots exclusion, e.g. support for more complex rules, e.g regular expressions. There are also useful potentially useful robot controls that are not supported by enough spiders, or are not well documented e.g. crawl-delay

Most additions to the protocol seem to come from search engines themselves, rather than from the webmasters who provide the reason for robots to visit.

Is there anything else that would help webmasters control robot activity?

jbinbpt

12:27 am on Jul 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As long as it's up to the spiders to obey the rule sets, we will be chasing them forever.

I would like to see some controls available to us, but I currently don't have much faith they will be effective. There are no consequences.

pixel_juice

11:12 pm on Jul 7, 2006 (gmt 0)

10+ Year Member




I would like to see some controls available to us, but I currently don't have much faith they will be effective. There are no consequences.

I think that depends. Rogue spiders are a different type of problem, but major search engines are (to at least some extent) accountable.

It appears to me that search engines currently have carte blanche to download and store any file that is published to the public internet. The only option given to webmasters (OK, there are a few variations) is to say 'no'. I guess I think by now we should have better control over (well-behaved) robots.

The issue may become more and more relevant considering that publishing web pages is now a widespread (and mainstream) activity.