Welcome to WebmasterWorld Guest from 54.147.44.13

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Is this spider a spam bot?

     
5:30 pm on Mar 8, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Mozilla/4.0 (compatible; Crawler; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

That keeps showing up on my site, often giving thousands of hits a day. I am trying to block this but had no success yet. I did the Disallow / thing on my robots, but this one is still getting through.

Can someone tell me the code to block this bot on my htaccess file? I am wondering what I should put in to block them, because it is not a simple name like "bingbot". Do all I need to put in is Mozilla/4.0 Or maybe Mozilla/4.0*

Or do I have to put the entire thing on my htaccess including what is in the parenthesis?

Also what is the exact code structure you use to block bots on htaccess? I have seen many examples, can you paste exactly how I should code this with my Mozilla bot code?

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^bingbot
RewriteRule ^.* - [F]
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

Is this the correct code to use? Or I am missing anything?

thank you
8:26 pm on Mar 8, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


It's an error to focus upon Mozilla/4.0

Rather, you should focus upon the word "crawler", which will catch more pests than this one.

Not sure why you want Bing denied?

One way:
RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bingbot [NC]
RewriteRule .* - [F]

another way:
RewriteCond %{HTTP_USER_AGENT} ^(Bing|Crawler) [NC]
RewriteRule .* - [F]

In the latter example, you may use multiple lines, keeping 6-8 words organized on each line.
Then separating the the subsequent lines with [OR], with the exception of the last Condition, which will NOT have an [OR]

[edited by: wilderness at 8:32 pm (utc) on Mar 8, 2012]

8:28 pm on Mar 8, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


FWIW and in order to benefit others in this forum you should include the IP range as well, however obscuring the Class D number.

Some of these pests aren't even worth bothering with the UA and may be mass denied via IP.
10:03 pm on Mar 8, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6130
votes: 277


Other info: there are no "spam bots" ... spam is a different kind of ugly on the web, but all bots are scrapers, even the ones we want to come get our stuff.
1:30 am on Mar 9, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Well I have noticed thousands of hits coming from Bing Robot lately. I wasn't sure if I should let them continue to do that, but I will let it go for now. So, should I make my code exactly like I have it below? Where it says "RewriteRule" I should just leave it as .* then? Because I saw other examples of this code with site URL's in there, but if I don't need a URL I can leave it this way.

RewriteCond %{HTTP_USER_AGENT} ^Crawler [NC]
RewriteRule .* - [F]

thanks
1:43 am on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


My apologies.

You should omit the leading anchor (begins with).
By doing so, you use "contains" and the word may be located anywhere within the UA.

#turn on Rewrite, if NOT done previously
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Crawler [NC]
RewriteRule .* - [F]
2:04 am on Mar 9, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64


Mario155 you should consider allowing the Big 4 SEs but filter them by IP range. There are many imposters out there. Making sure they are who they say they are is a must nowadays. Example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot)
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.10[67]\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Slurp [NC]
RewriteCond %{REMOTE_ADDR} !^67\.195\.
RewriteCond %{REMOTE_ADDR} !^72\.30\.
RewriteCond %{REMOTE_ADDR} !^74\.6\.
RewriteCond %{REMOTE_ADDR} !^98\.13[6-9]\.
RewriteCond %{REMOTE_ADDR} !^202\.160\.1[7-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^203\.209\.2[2-5][0-9]\.
RewriteRule !^robots\.txt$ - [F]
RewriteCond %{HTTP_USER_AGENT} Yandex(Antivirus|Bot|Images|Media) [NC]
RewriteCond %{REMOTE_ADDR} !^77\.88\.[45][0-3]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.14[67]\.
RewriteCond %{REMOTE_ADDR} !^93\.158\.153\.
RewriteCond %{REMOTE_ADDR} !^95\.108\.[12][1-5][0-9]\.
RewriteCond %{REMOTE_ADDR} !^178\.154\.[12][0-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^199\.21\.9[6-9]\.
RewriteRule !^robots\.txt$ - [F]
3:59 am on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


I'm disappointed keyplr.

Here I thought you had a tight rein and your letting 131.10-, the Slupr APNIC's, and host of Yandex IP;s, all from Euro.

What's this world coming to ;)
5:02 am on Mar 9, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64




wilderness - as I said, this was an example for the OP. My code is more restrictive but I'll let the OP tweak the IP ranges to suit his/her needs. Not all sites are the same ya know :)

MSN's 131.10* used to be used for legit purposes, although it fallen into the stealth hit and run category nowadays, I have still allowed the range because I block the nefarious requests by other means. But you may be right, it probably should be omitted.

And FYI Europe represents 30% of my sales and Yandex sends triple digit daily traffic my way.
6:09 am on Mar 9, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64


Mario155, as far as the generic "crawler", there are many agents using "crawler" and "spider" that you may find beneficial to your site, so you'll need to allow some of these - I have something like this (if wilderness approves that is):

RewriteCond %{HTTP_USER_AGENT} (crawl|spider) [NC]
RewriteCond %{HTTP_USER_AGENT} !^add allowed UAs that start with the name here
RewriteCond %{HTTP_USER_AGENT} !add allowed UAs that have name in the middle here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteCond %{REMOTE_ADDR} !^add allowed IP range here
RewriteRule !^robots\.txt$ - [F]
2:41 pm on Mar 9, 2012 (gmt 0)

New User

joined:Mar 8, 2012
posts:26
votes: 0


Yesterday I received 13,000 spider hits on my main site. That's got to be too much right? 4500 of them were from Bing Robot, and 2300 from that Mozilla/4.0 thing.

I know I want to allow Google, Yahoo, and a few others access, but Mozilla/4.0 has got to be a bad spider right?
2:59 pm on Mar 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Mozilla/4.0 has got to be a bad spider right?


As I explained previously, this term is the wrong thing to focus upon.

Rather, you focus upon the IP and other portions of the UA.

Assuredly NOT Mozilla/4.0 alone!

Yesterday I received 13,000 spider hits on my main site. That's got to be too much right?


Nobody here has any notion of the size of your site (number of pages) and/or what is normal traffic on on your site.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members