Forum Moderators: open

Message Too Old, No Replies

Banning spiders except for a few I want, via .htaccess

.htaccess earch engine spider control

         

revrob

7:43 pm on Mar 16, 2010 (gmt 0)

10+ Year Member



With regard to spiders and rogue bots... and controlling their access to my charity web sites

Given that rogue spiders often ignore robots.txt, what methods can be used (eg: .htaccess statements) to restrict access to the whole site to a very small number of recognised large legit search engines?

I'm not worried about commercial SEO requirements, I have very very restricted "search" requirements for my sites - which is that people searching on the very obvious terms relating to the name of the site/charity, or specific popular page titles, find us, and that seems to work fine on our current keyword and title and content strategy. I've found that getting a listing in Open Directory/dmoz.org and sorting keywords and content and page title metatags seems to satisfy most of my ranking requirements.

But I would like to save a bit of bandwidth and improve security by a robust blunderbuss banning of all spiders except the ones I choose - which would probably be google, yahoo slurp, bing, and one or two others.

I already run bot traps via .htaccess which result in some rogue bots collecting automatic IP address bans for themselves - but that does make for a very big .htaccess file as the list grows.

Anyone got any suggestions for appropriate global banning .htaccess statements and also specific allow statements?

And a list of mainstream web search engines which it would be sensible to allow?

Many thanks to any who can help.

tangor

10:29 pm on Mar 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Run an expression for ALLOWED UA's. If match, allowed, else reject.

revrob

7:41 am on Mar 17, 2010 (gmt 0)

10+ Year Member



Thank you tangor. What would that look like please? (real beginner here, sorry)