Welcome to WebmasterWorld Guest from 107.20.59.213

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Whitelisting just the good bots

     

Dexie

7:49 am on Jan 19, 2011 (gmt 0)

10+ Year Member



In the .HTaccess file, how would you just allow the 'good' bots and then disallow any others please ?

In this context, I would class the following as 'good' bots: Googlebot, Yahoo Slurp, MSNBot, Ask77, MSNBot-media, The web archive (IA Archiver)?, Alexa

Any help appreciated.

Dexie

g1smd

6:52 pm on Jan 19, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Some people would class at least one of those as a "bad" bot.

topr8

7:19 pm on Jan 19, 2011 (gmt 0)

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



sadly what you want to do is not as simple as adding a few lines to your .htaccess file

for one thing the worst bots masquerade as ordinary users/browsers

most of the enthusiasts of this particular forum have developed more or less complicated applications (depending on how hard core they are) which involve several layers of bot blocking

of which using .htaccess is but a small part of the process

... as far as i know there is no off the shelf solution.

reading the forum library [webmasterworld.com...] is a good way to get an insight into the task ahead.

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.

Dexie

10:25 pm on Jan 19, 2011 (gmt 0)

10+ Year Member



Thanks for the info, it's appreciated. What do you put in your .htaccess for this?

tangor

12:02 am on Jan 20, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Specifically, you'll have to use the deny,allow directives, then define which pass and all others fail. Suggest starting with robots.txt and deny all bots then, also in robots.txt, allow only the ones you want. Then, after sufficient time to gather data on which bots do NOT honor robots.txt, add those to your deny (fail) entries in .htaccess.

Each webmaster will have a different set of criteria as to which "pass" and which "fail".

Some use rewrites, some use SetEnvIf ... The forum library, and the forum itself has many great examples to get started. In general we don't give out examples, we'd like to see your best effort code then address any errors. But, having said that, there ARE great examples of how to get started in the library.

Dexie

7:55 am on Jan 20, 2011 (gmt 0)

10+ Year Member



Have searched through here, but have you any particular threads in mind ?

How do you stop a bot, when you don't know their IP address ?

tangor

9:35 am on Jan 20, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Stop by UA, or stop by speed of access (behavior), or stop by IP range address, or by country/geo... The Magnificent Obsession of blocking bots and unwanted traffic is more than a hobby, for some it is a "Way of Life". :)

My stuff is pretty simple, example from .htaccess

SetEnvIfNoCase User-Agent "nutch" ban

My Order Deny,Allow looks for "nutch" in any UA string, regardless of case, and sends it a 403 via the env variable "ban"

Dexie

9:35 pm on Jan 22, 2011 (gmt 0)

10+ Year Member



Thanks Tangor, it's appreciated. Is Nutch in quite a lot of the UA's ?

topr8

9:45 pm on Jan 22, 2011 (gmt 0)

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>>Have searched through here, but have you any particular threads in mind ?

if you checked out the link in my first reply to you, you would have seen all the relevant threads, including the up to date .htaccess thread

Dexie

9:58 pm on Jan 22, 2011 (gmt 0)

10+ Year Member



My apologies, will have another look, but didn't see any updated .htaccess thread? but will keep on looking through.

wilderness

10:31 pm on Jan 22, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



topr8 wrote

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.


I'm in total agreement and wish it was possible as a requirement that webmasters proceed slowly, building their htaccess skills and expanding their effectiveness as time goes on.

Expecting a newbie to htaccess, to comprehend at the same level as another webmaster whom has been using skills for 5-10 is too high a goal for that newbie.

tangor

3:11 am on Jan 23, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Is Nutch in quite a lot of the UA's ?


Whether "nutch" is in UA strings or not is not the real question... Some "nutch" is okay, most is not, but each webmaster has to make a personal determination as to which is which, and that means studying access logs, looking at bandwidth, determining traffic benefits... a whole list of things.

Again, the LESS STRESSFUL way is to start with "who do I let in" instead of "who do I kick out".

Think about a party in your livingroom. It's easier to control the party by only INVITING people than attempting to kick out all the UNINVITED or RUDE, ROWDY, or just SPAMMY attendees. That's what whitelisting accomplishes. Once whitelisting is established the only ones kicked out after that are gate crashers... and that's a much smaller drain on time and energy.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month