Welcome to WebmasterWorld Guest from 54.211.82.105

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Whitelisting just the good bots

     
7:49 am on Jan 19, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 26, 2004
posts: 364
votes: 0


In the .HTaccess file, how would you just allow the 'good' bots and then disallow any others please ?

In this context, I would class the following as 'good' bots: Googlebot, Yahoo Slurp, MSNBot, Ask77, MSNBot-media, The web archive (IA Archiver)?, Alexa

Any help appreciated.

Dexie
6:52 pm on Jan 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Some people would class at least one of those as a "bad" bot.
7:19 pm on Jan 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3228
votes: 17


sadly what you want to do is not as simple as adding a few lines to your .htaccess file

for one thing the worst bots masquerade as ordinary users/browsers

most of the enthusiasts of this particular forum have developed more or less complicated applications (depending on how hard core they are) which involve several layers of bot blocking

of which using .htaccess is but a small part of the process

... as far as i know there is no off the shelf solution.

reading the forum library [webmasterworld.com...] is a good way to get an insight into the task ahead.

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.
10:25 pm on Jan 19, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 26, 2004
posts: 364
votes: 0


Thanks for the info, it's appreciated. What do you put in your .htaccess for this?
12:02 am on Jan 20, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7051
votes: 423


Specifically, you'll have to use the deny,allow directives, then define which pass and all others fail. Suggest starting with robots.txt and deny all bots then, also in robots.txt, allow only the ones you want. Then, after sufficient time to gather data on which bots do NOT honor robots.txt, add those to your deny (fail) entries in .htaccess.

Each webmaster will have a different set of criteria as to which "pass" and which "fail".

Some use rewrites, some use SetEnvIf ... The forum library, and the forum itself has many great examples to get started. In general we don't give out examples, we'd like to see your best effort code then address any errors. But, having said that, there ARE great examples of how to get started in the library.
7:55 am on Jan 20, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 26, 2004
posts: 364
votes: 0


Have searched through here, but have you any particular threads in mind ?

How do you stop a bot, when you don't know their IP address ?
9:35 am on Jan 20, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7051
votes: 423


Stop by UA, or stop by speed of access (behavior), or stop by IP range address, or by country/geo... The Magnificent Obsession of blocking bots and unwanted traffic is more than a hobby, for some it is a "Way of Life". :)

My stuff is pretty simple, example from .htaccess

SetEnvIfNoCase User-Agent "nutch" ban

My Order Deny,Allow looks for "nutch" in any UA string, regardless of case, and sends it a 403 via the env variable "ban"
9:35 pm on Jan 22, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 26, 2004
posts: 364
votes: 0


Thanks Tangor, it's appreciated. Is Nutch in quite a lot of the UA's ?
9:45 pm on Jan 22, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3228
votes: 17


>>Have searched through here, but have you any particular threads in mind ?

if you checked out the link in my first reply to you, you would have seen all the relevant threads, including the up to date .htaccess thread
9:58 pm on Jan 22, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 26, 2004
posts: 364
votes: 0


My apologies, will have another look, but didn't see any updated .htaccess thread? but will keep on looking through.
10:31 pm on Jan 22, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts: 5459
votes: 3


topr8 wrote

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.


I'm in total agreement and wish it was possible as a requirement that webmasters proceed slowly, building their htaccess skills and expanding their effectiveness as time goes on.

Expecting a newbie to htaccess, to comprehend at the same level as another webmaster whom has been using skills for 5-10 is too high a goal for that newbie.
3:11 am on Jan 23, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7051
votes: 423


Is Nutch in quite a lot of the UA's ?


Whether "nutch" is in UA strings or not is not the real question... Some "nutch" is okay, most is not, but each webmaster has to make a personal determination as to which is which, and that means studying access logs, looking at bandwidth, determining traffic benefits... a whole list of things.

Again, the LESS STRESSFUL way is to start with "who do I let in" instead of "who do I kick out".

Think about a party in your livingroom. It's easier to control the party by only INVITING people than attempting to kick out all the UNINVITED or RUDE, ROWDY, or just SPAMMY attendees. That's what whitelisting accomplishes. Once whitelisting is established the only ones kicked out after that are gate crashers... and that's a much smaller drain on time and energy.