Welcome to WebmasterWorld Guest from 54.163.94.5

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Is this spider a spam bot?

     
5:30 pm on Mar 8, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Mozilla/4.0 (compatible; Crawler; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

That keeps showing up on my site, often giving thousands of hits a day. I am trying to block this but had no success yet. I did the Disallow / thing on my robots, but this one is still getting through.

Can someone tell me the code to block this bot on my htaccess file? I am wondering what I should put in to block them, because it is not a simple name like "bingbot". Do all I need to put in is Mozilla/4.0 Or maybe Mozilla/4.0*

Or do I have to put the entire thing on my htaccess including what is in the parenthesis?

Also what is the exact code structure you use to block bots on htaccess? I have seen many examples, can you paste exactly how I should code this with my Mozilla bot code?

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^bingbot
RewriteRule ^.* - [F]
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

Is this the correct code to use? Or I am missing anything?

thank you
8:26 pm on Mar 8, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


It's an error to focus upon Mozilla/4.0

Rather, you should focus upon the word "crawler", which will catch more pests than this one.

Not sure why you want Bing denied?

One way:
RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bingbot [NC]
RewriteRule .* - [F]

another way:
RewriteCond %{HTTP_USER_AGENT} ^(Bing|Crawler) [NC]
RewriteRule .* - [F]

In the latter example, you may use multiple lines, keeping 6-8 words organized on each line.
Then separating the the subsequent lines with [OR], with the exception of the last Condition, which will NOT have an [OR]

[edited by: wilderness at 8:32 pm (utc) on Mar 8, 2012]

8:28 pm on Mar 8, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


FWIW and in order to benefit others in this forum you should include the IP range as well, however obscuring the Class D number.

Some of these pests aren't even worth bothering with the UA and may be mass denied via IP.
10:03 pm on Mar 8, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7055
votes: 424


Other info: there are no "spam bots" ... spam is a different kind of ugly on the web, but all bots are scrapers, even the ones we want to come get our stuff.
1:30 am on Mar 9, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Well I have noticed thousands of hits coming from Bing Robot lately. I wasn't sure if I should let them continue to do that, but I will let it go for now. So, should I make my code exactly like I have it below? Where it says "RewriteRule" I should just leave it as .* then? Because I saw other examples of this code with site URL's in there, but if I don't need a URL I can leave it this way.

RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC]
RewriteRule .* - [F]

thanks
1:43 am on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


My apologies.

You should omit the leading anchor (begins with).
By doing so, you use "contains" and the word may be located anywhere within the UA.

#turn on Rewrite, if NOT done previously
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Crawler [NC]
RewriteRule .* - [F]
2:04 am on Mar 9, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7009
votes: 175


Mario155 you should consider allowing the Big 4 SEs but filter them by IP range. There are many imposters out there. Making sure they are who they say they are is a must nowadays. Example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot)
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.10[67]\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Slurp [NC]
RewriteCond %{REMOTE_ADDR} !^67\.195\.
RewriteCond %{REMOTE_ADDR} !^72\.30\.
RewriteCond %{REMOTE_ADDR} !^74\.6\.
RewriteCond %{REMOTE_ADDR} !^98\.13[6-9]\.
RewriteCond %{REMOTE_ADDR} !^202\.160\.1[7-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^203\.209\.2[2-5][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Yandex(Antivirus|Bot|Images|Media) [NC]
RewriteCond %{REMOTE_ADDR} !^77\.88\.[45][0-3]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.14[67]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.153\.
RewriteCond %{REMOTE_ADDR} !^95\.108\.[12][1-5][0-9]\.
RewriteCond %{REMOTE_ADDR} !^178\.154\.[12][0-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^199\.21\.9[6-9]\.
RewriteRule !^robots\.txt$ - [F]
3:59 am on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


I'm disappointed keyplr.

Here I thought you had a tight rein and your letting 131.10-, the Slupr APNIC's, and host of Yandex IP;s, all from Euro.

What's this world coming to ;)
5:02 am on Mar 9, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7009
votes: 175




wilderness - as I said, this was an example for the OP. My code is more restrictive but I'll let the OP tweak the IP ranges to suit his/her needs. Not all sites are the same ya know :)

MSN's 131.10* used to be used for legit purposes, although it fallen into the stealth hit and run category nowadays, I have still allowed the range because I block the nefarious requests by other means. But you may be right, it probably should be omitted.

And FYI Europe represents 30% of my sales and Yandex sends triple digit daily traffic my way.
6:09 am on Mar 9, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7009
votes: 175


Mario155, as far as the generic "crawler", there are many agents using "crawler" and "spider" that you may find beneficial to your site, so you'll need to allow some of these - I have something like this (if wilderness approves that is):

RewriteCond %{HTTP_USER_AGENT} (crawl|spider) [NC]
RewriteCond %{HTTP_USER_AGENT} !^add allowed UAs that start with the name here
RewriteCond %{HTTP_USER_AGENT} !add allowed UAs that have name in the middle here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteRule !^robots\.txt$ - [F]
2:41 pm on Mar 9, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Yesterday I received 13,000 spider hits on my main site. That's got to be too much right? 4500 of them were from Bing Robot, and 2300 from that Mozilla/4.0 thing.

I know I want to allow Google, Yahoo, and a few others access, but Mozilla/4.0 has got to be a bad spider right?
2:59 pm on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Mozilla/4.0 has got to be a bad spider right?


As I explained previously, this term is the wrong thing to focus upon.

Rather, you focus upon the IP and other portions of the UA.

Assuredly NOT Mozilla/4.0 alone!

Yesterday I received 13,000 spider hits on my main site. That's got to be too much right?


Nobody here has any notion of the size of your site (number of pages) and/or what is normal traffic on on your site.