homepage Welcome to WebmasterWorld Guest from 54.145.172.149
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
.htaccess block for MJ12bot
classifieds

10+ Year Member



 
Msg#: 3603824 posted 10:16 am on Mar 18, 2008 (gmt 0)

I keep getting hit by multiple scrapers using "MJ12bot." I've added the UA string to my .htaccess filters but for some reason it's not matching.

For example, the httpd log file contains
220.246.54.157 - - [18/Mar/2008:05:11:32 -0400] "GET /chantilly_va-rs4359/ HTTP/1.1" 200 17577 "http://www.domain.com/chantilly_va-rs4359" "Mozilla/5.0 (compatible; MJ12bot/v1.2.1; hxxp://www.majestic12.co.uk/bot.php?+)"

and my .htacces contains:
RewriteCond %{HTTP_USER_AGENT} ^MJ12bo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
RewriteRule .* - [F]

But it's not matching and I'm forced to use a "deny from" on the IP address range.

Any suggestions or recommendations would be greatly appreciated. These things keep overloading my server and it's taking lots of time away from other projects.

 

Receptional Andy



 
Msg#: 3603824 posted 10:27 am on Mar 18, 2008 (gmt 0)

I think the problem is that you've start-anchored the pattern, so you would need the UA to start with MJ12bot.

I'm no expert, but try

RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot

Incidentally, I'm not sure if the bot you're seeing is actually from MJ12 (they've posted about fake versions [majestic12.co.uk]. Still, that bot has been around for years without any seemingly useful element for site-owners.

[edited by: Receptional_Andy at 10:31 am (utc) on Mar. 18, 2008]

classifieds

10+ Year Member



 
Msg#: 3603824 posted 10:58 am on Mar 18, 2008 (gmt 0)

Thanks for the suggestion.

I added it so I should know in the next few hours if it works or not.

btw, IMHO the only good bot is a dead bot (minus msn, googlebot and slurp of course).

Receptional Andy



 
Msg#: 3603824 posted 11:19 am on Mar 18, 2008 (gmt 0)

Just change the user agent yourself to test. If you're a firefox user try the "user agent switcher" add-on. Otherwise there are online tools to achieve the same thing.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3603824 posted 4:52 pm on Mar 18, 2008 (gmt 0)

Rather than using the "^.*" subpattern, you can just remove the start-anchor:

RewriteCond %{HTTP_USER_AGENT} MJ12bot

This is also true for end-anchors: Instead of matching "something.*$" just use "something" as the pattern.

Note that MJ12bot is a legitimate robot which reads and obeys robots.txt. However, it is currently being spoofed by others. There's a post by the owner in our robots.txt forum describing how to determine the legitimate MJ12bot from the spoofs.

Jim

Receptional Andy



 
Msg#: 3603824 posted 7:42 pm on Mar 18, 2008 (gmt 0)

Rather than using the "^.*" subpattern, you can just remove the start-anchor

Thanks for the correction, Jim. When I posted mine I had a nagging thought telling me there was a much better way, but I couldn't figure it out at the time ;)

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3603824 posted 8:02 pm on Mar 18, 2008 (gmt 0)

Hi,

That's looks like one of ours - even though I can't say for sure as IP does not seem to match exactly, I will need to see bots request for robots.txt to be sure.

If you was to stop it crawling your site then all you need to do is to add it to robots.txt:

User-agent: MJ12bot
Disallow: /

If you want extra protection I can add it to a special no-crawl list, we can discuss it via sticky if you want.

Blocking by IP is not a smart move and generally blocking by HTACCESS bots that support robots.txt is not very smart either - at the very least allow robots.txt to be taken for the block to be efficiently implemented.

We indeed had a problem with fake bots pretending to be us - however this situation seems to have stopped in late January 2008 and I have not had any fake bot reports since then.

So, to reiterate - we obey robots.txt, this includes Crawl-Delay param that can help slow down crawling, also new bot that is being beta tested now supports GZIP to reduce bandwidth usage on sites, if you want to block us please please please use roborts.txt - this will be best for you and us.

Edit - Receptional Andy: we have released big index last month that I think is beneficial to site owners, I can't post much about it here though, but those who seek shall find.

[edited by: Lord_Majestic at 8:04 pm (utc) on Mar. 18, 2008]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved