homepage Welcome to WebmasterWorld Guest from 54.211.80.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Is this spider a spam bot?
Mario155




msg:4426573
 5:30 pm on Mar 8, 2012 (gmt 0)

Mozilla/4.0 (compatible; Crawler; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

That keeps showing up on my site, often giving thousands of hits a day. I am trying to block this but had no success yet. I did the Disallow / thing on my robots, but this one is still getting through.

Can someone tell me the code to block this bot on my htaccess file? I am wondering what I should put in to block them, because it is not a simple name like "bingbot". Do all I need to put in is Mozilla/4.0 Or maybe Mozilla/4.0*

Or do I have to put the entire thing on my htaccess including what is in the parenthesis?

Also what is the exact code structure you use to block bots on htaccess? I have seen many examples, can you paste exactly how I should code this with my Mozilla bot code?

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^bingbot
RewriteRule ^.* - [F]
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

Is this the correct code to use? Or I am missing anything?

thank you

 

wilderness




msg:4426655
 8:26 pm on Mar 8, 2012 (gmt 0)

It's an error to focus upon Mozilla/4.0

Rather, you should focus upon the word "crawler", which will catch more pests than this one.

Not sure why you want Bing denied?

One way:
RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bingbot [NC]
RewriteRule .* - [F]

another way:
RewriteCond %{HTTP_USER_AGENT} ^(Bing|Crawler) [NC]
RewriteRule .* - [F]

In the latter example, you may use multiple lines, keeping 6-8 words organized on each line.
Then separating the the subsequent lines with [OR], with the exception of the last Condition, which will NOT have an [OR]

[edited by: wilderness at 8:32 pm (utc) on Mar 8, 2012]

wilderness




msg:4426657
 8:28 pm on Mar 8, 2012 (gmt 0)

FWIW and in order to benefit others in this forum you should include the IP range as well, however obscuring the Class D number.

Some of these pests aren't even worth bothering with the UA and may be mass denied via IP.

tangor




msg:4426674
 10:03 pm on Mar 8, 2012 (gmt 0)

Other info: there are no "spam bots" ... spam is a different kind of ugly on the web, but all bots are scrapers, even the ones we want to come get our stuff.

Mario155




msg:4426749
 1:30 am on Mar 9, 2012 (gmt 0)

Well I have noticed thousands of hits coming from Bing Robot lately. I wasn't sure if I should let them continue to do that, but I will let it go for now. So, should I make my code exactly like I have it below? Where it says "RewriteRule" I should just leave it as .* then? Because I saw other examples of this code with site URL's in there, but if I don't need a URL I can leave it this way.

RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC]
RewriteRule .* - [F]

thanks

wilderness




msg:4426752
 1:43 am on Mar 9, 2012 (gmt 0)

My apologies.

You should omit the leading anchor (begins with).
By doing so, you use "contains" and the word may be located anywhere within the UA.

#turn on Rewrite, if NOT done previously
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Crawler [NC]
RewriteRule .* - [F]

keyplyr




msg:4426758
 2:04 am on Mar 9, 2012 (gmt 0)

Mario155 you should consider allowing the Big 4 SEs but filter them by IP range. There are many imposters out there. Making sure they are who they say they are is a must nowadays. Example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot)
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.10[67]\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Slurp [NC]
RewriteCond %{REMOTE_ADDR} !^67\.195\.
RewriteCond %{REMOTE_ADDR} !^72\.30\.
RewriteCond %{REMOTE_ADDR} !^74\.6\.
RewriteCond %{REMOTE_ADDR} !^98\.13[6-9]\.
RewriteCond %{REMOTE_ADDR} !^202\.160\.1[7-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^203\.209\.2[2-5][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Yandex(Antivirus|Bot|Images|Media) [NC]
RewriteCond %{REMOTE_ADDR} !^77\.88\.[45][0-3]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.14[67]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.153\.
RewriteCond %{REMOTE_ADDR} !^95\.108\.[12][1-5][0-9]\.
RewriteCond %{REMOTE_ADDR} !^178\.154\.[12][0-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^199\.21\.9[6-9]\.
RewriteRule !^robots\.txt$ - [F]

wilderness




msg:4426785
 3:59 am on Mar 9, 2012 (gmt 0)

I'm disappointed keyplr.

Here I thought you had a tight rein and your letting 131.10-, the Slupr APNIC's, and host of Yandex IP;s, all from Euro.

What's this world coming to ;)

keyplyr




msg:4426800
 5:02 am on Mar 9, 2012 (gmt 0)



wilderness - as I said, this was an example for the OP. My code is more restrictive but I'll let the OP tweak the IP ranges to suit his/her needs. Not all sites are the same ya know :)

MSN's 131.10* used to be used for legit purposes, although it fallen into the stealth hit and run category nowadays, I have still allowed the range because I block the nefarious requests by other means. But you may be right, it probably should be omitted.

And FYI Europe represents 30% of my sales and Yandex sends triple digit daily traffic my way.

keyplyr




msg:4426815
 6:09 am on Mar 9, 2012 (gmt 0)

Mario155, as far as the generic "crawler", there are many agents using "crawler" and "spider" that you may find beneficial to your site, so you'll need to allow some of these - I have something like this (if wilderness approves that is):

RewriteCond %{HTTP_USER_AGENT} (crawl|spider) [NC]
RewriteCond %{HTTP_USER_AGENT} !^add allowed UAs that start with the name here
RewriteCond %{HTTP_USER_AGENT} !add allowed UAs that have name in the middle here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteRule !^robots\.txt$ - [F]

Mario155




msg:4426986
 2:41 pm on Mar 9, 2012 (gmt 0)

Yesterday I received 13,000 spider hits on my main site. That's got to be too much right? 4500 of them were from Bing Robot, and 2300 from that Mozilla/4.0 thing.

I know I want to allow Google, Yahoo, and a few others access, but Mozilla/4.0 has got to be a bad spider right?

wilderness




msg:4426998
 2:59 pm on Mar 9, 2012 (gmt 0)

Mozilla/4.0 has got to be a bad spider right?


As I explained previously, this term is the wrong thing to focus upon.

Rather, you focus upon the IP and other portions of the UA.

Assuredly NOT Mozilla/4.0 alone!

Yesterday I received 13,000 spider hits on my main site. That's got to be too much right?


Nobody here has any notion of the size of your site (number of pages) and/or what is normal traffic on on your site.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved