Forum Moderators: open

Message Too Old, No Replies

How to enable every spider from Google, Yahoo and Bing ?

How to bann every other spider/bot/robot !

         

Future

6:16 pm on Jun 12, 2010 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hello,
Is there anyway to enable all crawlers/robots/spiders/bots only belonging to following domains ?

google.com
yahoo.com
bing.com
live.com

And is there anyway to block each and every other spider ?

dstiles

9:20 pm on Jun 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can allow all of those, with specific disallows on pages that should not be scanned, using robots.txt, which can be arranged to block all other bots THAT OBEY THE RULES! (robots.txt does not actually allow or block, it just tells bots what they SHOULD do.)

Note that bing and live both use msnbot.

If you want to get rid of bots that do not obey robots.txt (about 99% of them) then you need seomthing stronger such as htaccess monitoring the traffic. There's some info on this posted recently in the google seo forum of this site as well as in this forum.

keyplyr

10:24 pm on Jun 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you're on an Apache (unix) server you can whitelist the ranges allowed to use known UAs belonging to those companies (and block others that try to spoof as them) by using mod_rewrite in an .htaccess file.


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Google [NC]
RewriteCond %{REMOTE_ADDR} !^64\.68\.[89][0-9]\.
RewriteCond %{REMOTE_ADDR} !^64\.233\.1[6-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^72\.14\.[12][0-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^74\.125\.
RewriteCond %{REMOTE_ADDR} !^209\.85\.[12][0-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^216\.239\.[3-6][0-9]\.
RewriteRule .* - [F]
RewriteCond %{HTTP_USER_AGENT} (msnbot|MSN\ Soci|MSR-ISR|MSRBOT) [NC]
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^131\.10[67]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule .* - [F]
RewriteCond %{HTTP_USER_AGENT} Slurp [NC]
RewriteCond %{REMOTE_ADDR} !^67\.195\.
RewriteCond %{REMOTE_ADDR} !^72\.30\.
RewriteCond %{REMOTE_ADDR} !^74\.6\.
RewriteRule .* - [F]

Future

10:54 am on Jun 13, 2010 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hello dstiles and keyplyr
Thank you very much for your valuable feedback.

How do you keep away bad bots that do not obey robots.txt and or content harvesters via rss feeds ?
Do you recommend banning all bots, except google, yahoo and bing?

Recently, we have found a huge increase in this SPAM bots on our site(s).
We have banned them all in robots.txt but still they keep reappearing.

jdMorgan

3:15 pm on Jun 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Future,

The above posts have some very important details in them. Make sure you understand these details, because they will answer both your original questions and your follow-on questions.

Jim

keyplyr

9:33 pm on Jun 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you recommend banning all bots, except google, yahoo and bing?

That would be subjective to your purposes. Many webmasters do block almost all bots, many don't block at all. I use a combination of generic UA blocking, whitelisting and IP blocking.

Some (new) bad bots get through. Sometimes I add them to the allow/disallow lists, sometimes I don't bother. Generally, if they request and respect robots.txt *and* if their purpose is clearly defined in a web page linked from the UA string *and* it fits within criteria I've established for my site, then I allow it.

The only real *recommendation* I have is to read the volumes of threads here at WW. The information archived here is invaluable.

dstiles

9:40 pm on Jun 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Future: which bots you ban is up to you. A lot are "me-too" or simply scraping content to sell to other businesses. Those are all worth banning.

I allow two or three dozen "real" bots from Abrave to Zeusbot (Ulyseek engine) and am slowly increasing the list as I discover new useful ones. They may not give me traffic now but may contribute later. (Warning: there is also a VERY bad Zeus scraper around!)

blend27

12:05 am on Jun 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplyr or anyone else that knows IIS,

Whould those .htaccess rules work with ISAPIRewrite3 on IIS Server.

I have WhiteListing setup on all of the sites I monitor as well, but my are on the software level due to the fact that I need some more info and white list some other regional bots. But Some sites really just there for the show, I don't need any addidtional info for those.

Thanks,

Blend27

keyplyr

1:49 am on Jun 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Whould those .htaccess rules work with ISAPIRewrite3 on IIS Server.

Not as written, they need to be adapted to IIs. I'm really the wrong person to ask.

Maybe post at: [webmasterworld.com...]

wilderness

3:31 am on Jun 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Microsoft IIS Web Server [webmasterworld.com]

keyplyr

3:45 am on Jun 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Duh - Why didn't I see that?

wilderness

12:26 pm on Jun 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplr,
I would not been aware of it myself, however I've had some IIS bookmarks saved for sometime. Was rechecking them and the URL came up on a search.

Don