Forum Moderators: phranque
I am currently working on a very large project, and looking at statistics from older and much smaller projects, the bots took up 50-70% of the bandwidth and queries. Now, that wasn't an issue with those projects since I didn't pay for bandwidth. But with this project it will be a major issue in a year or two.
The only major robots I want to be able to spider my website are
Google
Yahoo
MSN
Internet Archive
and then of course some selected smaller ones, but that's not really so important. Is it possible to do this on a server wide basis with either .htaccess and/or robots.txt in combination or is there any other way to do it?
Thankyou in advance for all replies and suggestions...
Sincerely, and have fun,
# Robot Exclusion File
User-agent: Mediapartners-Google*
Disallow
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow:
User-agent: MSNBot
Disallow:
User-agent: ia_archiver
Disallow:
User-agent: yahoo-mmcrawler
Disallow:
User-agent: *
Disallow: /
The last one disallows all bots while the first six specified are allowed by the lack of a forward slash after Disallow: IMHO, I do not understand why you would want Internet Archive. Personally, I would not. Remember, though, in the end, bots can ignore this.
Marshall
I wish there was a program or a standard which could exclude all user agent bots unless they were specifically allowed.... would make the fight against unauthorized ripoffs so much easier. I'll go suggest it to the opensource community..
Thanks for the list anyways, will be in my next update of my website,
Sincerely, and have fun,