homepage Welcome to WebmasterWorld Guest from 54.197.183.230
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Whitelisting just the good bots
Dexie

10+ Year Member



 
Msg#: 4255036 posted 7:49 am on Jan 19, 2011 (gmt 0)

In the .HTaccess file, how would you just allow the 'good' bots and then disallow any others please ?

In this context, I would class the following as 'good' bots: Googlebot, Yahoo Slurp, MSNBot, Ask77, MSNBot-media, The web archive (IA Archiver)?, Alexa

Any help appreciated.

Dexie

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4255036 posted 6:52 pm on Jan 19, 2011 (gmt 0)

Some people would class at least one of those as a "bad" bot.

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4255036 posted 7:19 pm on Jan 19, 2011 (gmt 0)

sadly what you want to do is not as simple as adding a few lines to your .htaccess file

for one thing the worst bots masquerade as ordinary users/browsers

most of the enthusiasts of this particular forum have developed more or less complicated applications (depending on how hard core they are) which involve several layers of bot blocking

of which using .htaccess is but a small part of the process

... as far as i know there is no off the shelf solution.

reading the forum library [webmasterworld.com...] is a good way to get an insight into the task ahead.

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.

Dexie

10+ Year Member



 
Msg#: 4255036 posted 10:25 pm on Jan 19, 2011 (gmt 0)

Thanks for the info, it's appreciated. What do you put in your .htaccess for this?

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4255036 posted 12:02 am on Jan 20, 2011 (gmt 0)

Specifically, you'll have to use the deny,allow directives, then define which pass and all others fail. Suggest starting with robots.txt and deny all bots then, also in robots.txt, allow only the ones you want. Then, after sufficient time to gather data on which bots do NOT honor robots.txt, add those to your deny (fail) entries in .htaccess.

Each webmaster will have a different set of criteria as to which "pass" and which "fail".

Some use rewrites, some use SetEnvIf ... The forum library, and the forum itself has many great examples to get started. In general we don't give out examples, we'd like to see your best effort code then address any errors. But, having said that, there ARE great examples of how to get started in the library.

Dexie

10+ Year Member



 
Msg#: 4255036 posted 7:55 am on Jan 20, 2011 (gmt 0)

Have searched through here, but have you any particular threads in mind ?

How do you stop a bot, when you don't know their IP address ?

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4255036 posted 9:35 am on Jan 20, 2011 (gmt 0)

Stop by UA, or stop by speed of access (behavior), or stop by IP range address, or by country/geo... The Magnificent Obsession of blocking bots and unwanted traffic is more than a hobby, for some it is a "Way of Life". :)

My stuff is pretty simple, example from .htaccess

SetEnvIfNoCase User-Agent "nutch" ban

My Order Deny,Allow looks for "nutch" in any UA string, regardless of case, and sends it a 403 via the env variable "ban"

Dexie

10+ Year Member



 
Msg#: 4255036 posted 9:35 pm on Jan 22, 2011 (gmt 0)

Thanks Tangor, it's appreciated. Is Nutch in quite a lot of the UA's ?

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4255036 posted 9:45 pm on Jan 22, 2011 (gmt 0)

>>Have searched through here, but have you any particular threads in mind ?

if you checked out the link in my first reply to you, you would have seen all the relevant threads, including the up to date .htaccess thread

Dexie

10+ Year Member



 
Msg#: 4255036 posted 9:58 pm on Jan 22, 2011 (gmt 0)

My apologies, will have another look, but didn't see any updated .htaccess thread? but will keep on looking through.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4255036 posted 10:31 pm on Jan 22, 2011 (gmt 0)

topr8 wrote

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.


I'm in total agreement and wish it was possible as a requirement that webmasters proceed slowly, building their htaccess skills and expanding their effectiveness as time goes on.

Expecting a newbie to htaccess, to comprehend at the same level as another webmaster whom has been using skills for 5-10 is too high a goal for that newbie.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4255036 posted 3:11 am on Jan 23, 2011 (gmt 0)

Is Nutch in quite a lot of the UA's ?


Whether "nutch" is in UA strings or not is not the real question... Some "nutch" is okay, most is not, but each webmaster has to make a personal determination as to which is which, and that means studying access logs, looking at bandwidth, determining traffic benefits... a whole list of things.

Again, the LESS STRESSFUL way is to start with "who do I let in" instead of "who do I kick out".

Think about a party in your livingroom. It's easier to control the party by only INVITING people than attempting to kick out all the UNINVITED or RUDE, ROWDY, or just SPAMMY attendees. That's what whitelisting accomplishes. Once whitelisting is established the only ones kicked out after that are gate crashers... and that's a much smaller drain on time and energy.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved