Forum Moderators: open

Message Too Old, No Replies

It's time for a robots INCLUSION standard.

         

Brett_Tabke

9:14 am on Aug 2, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



With the proliferation of rogue abusive spiders on the internet, I think it is time to modify the Robots Exclusion recommendation. You can't call it a standard, since it was never adopted by the W3C.

What is need is a standard that says, "unless file x is present", robots are not allowed to access the site.

The only question, is how to do it?

IanTurner

10:25 am on Aug 2, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Nice idea but!!!!

There is no way to enforce such a protocol, except by identifying the rogue spiders and blocking them by IP.

It would however be useful in putting a lot of junk sites into the backwaters of the web.

agerhart

12:32 pm on Aug 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Brett,

Do you think that this is possible?

MaliciousDan

5:16 pm on Aug 2, 2001 (gmt 0)

10+ Year Member



This could only be implemented by people who write spidering software, which means you'd be asking the same people who already ignore the rules of robots.txt and various other courtesy standards to include support for a new rule when they either intentionally ignored the original rules, or were incapable of following them. Your effort would be better spent trying to move all the water in the pacific ocean into the atlantic ocean.

agerhart

5:20 pm on Aug 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You have a point.....unfortunately

mivox

5:21 pm on Aug 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The problem would be the same as it is with the current robots exclusion protocol... "rogue" spiders would ignore it, wouldn't they?

Unless server software could somehow be configured to accurately recognize spider vs. human visitors, and block spiders by default. But that would fall more into the .htaccess area than the robots.txt area.

alexjc

6:26 pm on Aug 2, 2001 (gmt 0)

10+ Year Member



A robots inclusion standards within "Robots.txt" would be convenient. But surely a bit of scripting would work just as well... although this would increase the server load, and may get you into trouble for cloacking!

But... however this is enforced, this will have important repercussions on the concept of search engines in general: I imagine only a small proportion of webmasters (if you can call them that) are aware of the 'robots.txt' file, so as a consequence, only a very small proportion of web pages would be indexable... which beats the point of a "www" search engine, since only the enlightened people would get in the databases.

As for rogue spiders, they are 'rogue' simply because we're enforcing our exclusion / inclusion 'standards' loosely. 'Robots.txt' is a bit like a polite notice "Please don't let your dog pee on the lawn." And just like dogs can't really control their bladder, web-hackers / S.E. staff don't especially want to control their spiders.

What we need is a electrified fence with barbed wire around the lawn, and a single steal door with secure footprint identification ;) How about hard-coding the robots inclusion/exclusion standard within the web server? A few extra lines in the .htaccess would do the trick nicely. Come to think of it, isn't this already possible with apache??

Alex

mivox

6:38 pm on Aug 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dogs can't really control their bladder
Dogs control their bladders very well, if their owners take the time to properly house break them... and train them to stay off your lawn. ;) If dog=bot (only does what it's trained/programmed to do) then owner=programmer (often too irresponsible to train/program dogbot properly).

That's the problem... we're talking about putting a sign on the lawn saying "Only Fido and Rover are allowed to pee here" and then expecting all the other neighborhood dogbots to obey. Unless their owner/programmers see the sign, and decide to program their dogbots accordingly, it ain't gonna work...

So using your electric barbed wire .htaccess file fence is the only thing that will keep badly trained dogbots off your lawn. Which leaves us back exactly where we are now.

alexjc

7:38 pm on Aug 3, 2001 (gmt 0)

10+ Year Member



> Dogs control their bladders very well

"Hey, is that a lampost? Ohhhh... Aha, nice tree!"

This dog metaphore can be taken even further ;) I think the inclusion standard won't work because dogs like to mark their territory, just like SE's like to index wast quantities of pages and brag about them (Wisenut, Google :) Putting a polite sign on the lawn will reduce their territory, so why respect it?

Besides, and inclusion standard is just a lazzy alternative for what we have already (listing all exclusions :)... might aswell stick to that! If you need anything more, dive into apache.

Fair?