Forum Moderators: phranque

Message Too Old, No Replies

How to Block Automated Queries, Robots and Bots Except Selected Ones.

Is it possible to block all automated queries, robots and bots from spideri

         

RandomDot

3:09 pm on Sep 14, 2007 (gmt 0)

10+ Year Member



Is it possible to block all automated queries, robots and bots from spidering a website except from those which I specify are allowed?

I am currently working on a very large project, and looking at statistics from older and much smaller projects, the bots took up 50-70% of the bandwidth and queries. Now, that wasn't an issue with those projects since I didn't pay for bandwidth. But with this project it will be a major issue in a year or two.

The only major robots I want to be able to spider my website are
Google
Yahoo
MSN
Internet Archive

and then of course some selected smaller ones, but that's not really so important. Is it possible to do this on a server wide basis with either .htaccess and/or robots.txt in combination or is there any other way to do it?

Thankyou in advance for all replies and suggestions...
Sincerely, and have fun,

Marshall

7:59 am on Sep 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Basically, you want a general disallow robots.txt file with some exceptions. It would read:

# Robot Exclusion File
User-agent: Mediapartners-Google*
Disallow
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow:
User-agent: MSNBot
Disallow:
User-agent: ia_archiver
Disallow:
User-agent: yahoo-mmcrawler
Disallow:
User-agent: *
Disallow: /

The last one disallows all bots while the first six specified are allowed by the lack of a forward slash after Disallow: IMHO, I do not understand why you would want Internet Archive. Personally, I would not. Remember, though, in the end, bots can ignore this.

Marshall

RandomDot

3:28 pm on Sep 15, 2007 (gmt 0)

10+ Year Member



Just wondering if this can also be used in the .htaccess file - to further disallow most bots from spidering my website?

I wish there was a program or a standard which could exclude all user agent bots unless they were specifically allowed.... would make the fight against unauthorized ripoffs so much easier. I'll go suggest it to the opensource community..

Thanks for the list anyways, will be in my next update of my website,

Sincerely, and have fun,

jtara

7:24 pm on Sep 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This would not be easy to do. The problem is, bots can lie. Even if they don't lie, you would still need a list (constantly changing) of all of the acceptable user-agent strings (or matching patterns) for regular browsers. Some legitimate users, unfortunately, are going to be caught in your net.