Forum Moderators: phranque

Message Too Old, No Replies

Whitelisting good robots and agents via htaccess?

         

b8caster

4:46 pm on Nov 13, 2010 (gmt 0)

10+ Year Member



Hello,
I have just returned from PubCom and in one of the sessions it was mentioned the best way to block scrapers and bad bots was to whitelist the legit bots and user agents via htaccess. The speaker mentioned the code could be found here, but after numerous searches, I didn't find any recent posts that fit the bill.

So can anyone lend a hand here? Either point me to a current thread, or if you know of the htaccess code to use, please post it here.

Thanks in advance for your help!

wilderness

7:09 pm on Nov 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some portions or "examples" and/or "suggested methods" may be found, most have been presented in the SSID forum [webmasterworld.com].

However in recent years, the trend of that forum has been to exclude exposure of white listing methods because the harvesters monitor same forums and simply modify their requests to get around the whitelisting.

Here's are some very OLD and vary BASIC examples:
2006 [webmasterworld.com]

march 2006 [webmasterworld.com]

These examples were only intended at the time to motivate people to comprehend the required skill set of browsers and UA's and the requirement to implement a multi-faceted analyzing of visitors. And the implementation of same into htaccess.

They were NEVER intended to provide a copy and paste solution to webmasters.

whitelisting [google.com]

"white-listing" [google.com]

b8caster

2:03 am on Nov 14, 2010 (gmt 0)

10+ Year Member



Thanks for the info, but yes, I need a copy-n-paste solution. It's not that I'm lazy. I simply don't have the time, knowledge, or skillset to figure it out myself.

tangor

2:17 am on Nov 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Shortcut: Whitelist your allowed bots in robots.txt. Then examine raw logs to see which bots fail that simple test, then deny in .htaccess Read up the Search Engine ID forum [webmasterworld.com...] to id bot farms, bad players, etc.

This is a never ending battle, and there will be no "cut-n-paste" solutions for the very reasons wilderness provided. Time might be a consideration, but Knowledge is basic, Skillset is not that high, and all that's required is access to both robots.txt and .htaccess... and if possible http-config (next step up), or even better at the hardware firewall.

Small starts (suggested above in order) produce positive (large) results very quickly. It is the fine-tuning of deny that becomes more difficult.