homepage Welcome to WebmasterWorld Guest from 54.226.93.128
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Whitelisting just the good bots
Dexie




msg:4255038
 7:49 am on Jan 19, 2011 (gmt 0)

In the .HTaccess file, how would you just allow the 'good' bots and then disallow any others please ?

In this context, I would class the following as 'good' bots: Googlebot, Yahoo Slurp, MSNBot, Ask77, MSNBot-media, The web archive (IA Archiver)?, Alexa

Any help appreciated.

Dexie

 

g1smd




msg:4255305
 6:52 pm on Jan 19, 2011 (gmt 0)

Some people would class at least one of those as a "bad" bot.

topr8




msg:4255328
 7:19 pm on Jan 19, 2011 (gmt 0)

sadly what you want to do is not as simple as adding a few lines to your .htaccess file

for one thing the worst bots masquerade as ordinary users/browsers

most of the enthusiasts of this particular forum have developed more or less complicated applications (depending on how hard core they are) which involve several layers of bot blocking

of which using .htaccess is but a small part of the process

... as far as i know there is no off the shelf solution.

reading the forum library [webmasterworld.com...] is a good way to get an insight into the task ahead.

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.

Dexie




msg:4255431
 10:25 pm on Jan 19, 2011 (gmt 0)

Thanks for the info, it's appreciated. What do you put in your .htaccess for this?

tangor




msg:4255470
 12:02 am on Jan 20, 2011 (gmt 0)

Specifically, you'll have to use the deny,allow directives, then define which pass and all others fail. Suggest starting with robots.txt and deny all bots then, also in robots.txt, allow only the ones you want. Then, after sufficient time to gather data on which bots do NOT honor robots.txt, add those to your deny (fail) entries in .htaccess.

Each webmaster will have a different set of criteria as to which "pass" and which "fail".

Some use rewrites, some use SetEnvIf ... The forum library, and the forum itself has many great examples to get started. In general we don't give out examples, we'd like to see your best effort code then address any errors. But, having said that, there ARE great examples of how to get started in the library.

Dexie




msg:4255584
 7:55 am on Jan 20, 2011 (gmt 0)

Have searched through here, but have you any particular threads in mind ?

How do you stop a bot, when you don't know their IP address ?

tangor




msg:4255605
 9:35 am on Jan 20, 2011 (gmt 0)

Stop by UA, or stop by speed of access (behavior), or stop by IP range address, or by country/geo... The Magnificent Obsession of blocking bots and unwanted traffic is more than a hobby, for some it is a "Way of Life". :)

My stuff is pretty simple, example from .htaccess

SetEnvIfNoCase User-Agent "nutch" ban

My Order Deny,Allow looks for "nutch" in any UA string, regardless of case, and sends it a 403 via the env variable "ban"

Dexie




msg:4256776
 9:35 pm on Jan 22, 2011 (gmt 0)

Thanks Tangor, it's appreciated. Is Nutch in quite a lot of the UA's ?

topr8




msg:4256781
 9:45 pm on Jan 22, 2011 (gmt 0)

>>Have searched through here, but have you any particular threads in mind ?

if you checked out the link in my first reply to you, you would have seen all the relevant threads, including the up to date .htaccess thread

Dexie




msg:4256784
 9:58 pm on Jan 22, 2011 (gmt 0)

My apologies, will have another look, but didn't see any updated .htaccess thread? but will keep on looking through.

wilderness




msg:4256795
 10:31 pm on Jan 22, 2011 (gmt 0)

topr8 wrote

my own view is that every bit helps and so starting simple is ok, you can build on your blocking as you go along.


I'm in total agreement and wish it was possible as a requirement that webmasters proceed slowly, building their htaccess skills and expanding their effectiveness as time goes on.

Expecting a newbie to htaccess, to comprehend at the same level as another webmaster whom has been using skills for 5-10 is too high a goal for that newbie.

tangor




msg:4256857
 3:11 am on Jan 23, 2011 (gmt 0)

Is Nutch in quite a lot of the UA's ?


Whether "nutch" is in UA strings or not is not the real question... Some "nutch" is okay, most is not, but each webmaster has to make a personal determination as to which is which, and that means studying access logs, looking at bandwidth, determining traffic benefits... a whole list of things.

Again, the LESS STRESSFUL way is to start with "who do I let in" instead of "who do I kick out".

Think about a party in your livingroom. It's easier to control the party by only INVITING people than attempting to kick out all the UNINVITED or RUDE, ROWDY, or just SPAMMY attendees. That's what whitelisting accomplishes. Once whitelisting is established the only ones kicked out after that are gate crashers... and that's a much smaller drain on time and energy.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved