homepage Welcome to WebmasterWorld Guest from 54.204.127.56
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 38 message thread spans 2 pages: < < 38 ( 1 [2]     
MJ12bot
When you're a hammer, everything looks like a nail
Pfui




msg:3937582
 2:47 am on Jun 21, 2009 (gmt 0)

This distributed bot has been discussed (& kicked; & its creator argued-with) for years [google.com].* As the following sampling shows, the MJ12bot's misbehaviors remain unchanged:

UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]

06-12: From: Server Farm
robots.txt? NO

UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.4; [majestic12.co.uk...]

05-18: From: Server Farm
robots.txt? NO

05-07: From: . (that's a dot; misconfigured server)
robots.txt? NO

05-07: From: Server Farm
robots.txt? NO

05-02: From: 205.209.158.num
robots.txt? YES
Fake referer? YES

04-11: From: . (misconfigured server)
robots.txt? NO

Seems to me the long-standing, oft' proffered claim that the bot obeys robots.txt is specious/misleading when the MJ12bot doesn't ask for it in the first place.

[edited by: incrediBILL at 3:22 am (utc) on June 21, 2009]
[edit reason] Removed URL [/edit]

 

GaryK




msg:3940506
 9:00 pm on Jun 25, 2009 (gmt 0)

Let me know when it's ready too.

[edited by: incrediBILL at 9:28 pm (utc) on June 25, 2009]
[edit reason] fixed formatting [/edit]

Lord Majestic




msg:3940530
 9:40 pm on Jun 25, 2009 (gmt 0)

I'll post in this thread guys, thanks for your feedback, I hope once this is done you will have some extra confidence that we are the good guys in this...

incrediBILL




msg:3959652
 1:24 am on Jul 26, 2009 (gmt 0)

Any news?

Lord Majestic




msg:3959812
 3:33 pm on Jul 26, 2009 (gmt 0)

Sent PM to Bill, but will post here too - have not forgotten about this stuff, it is a commitment that I will deliver, just had lots of others to finish this month - very busy :(

I am going to send Bill info on proposed scheme, it can be discussed on here as well (if anyone interested) - once you are happy that the proposed way is acceptable I'll get it implemented. I think there is a good chance that we'll have this working (in test phase) by mid-August.

System
redhat



msg:3983456
 7:13 pm on Aug 31, 2009 (gmt 0)

The following 6 messages were cut out to new thread by incredibill. New thread at: search_engine_spiders/3983454.htm [webmasterworld.com]
8:52 am on Sep. 3, 2009 (PST -8)

System
redhat



msg:3985386
 7:13 pm on Aug 31, 2009 (gmt 0)

The following 41 messages were cut out to new thread by incredibill. New thread at: search_engine_spiders/3985384.htm [webmasterworld.com]
12:12 am on Sep. 7, 2009 (PST -8)

Pfui




msg:3985375
 7:26 am on Sep 7, 2009 (gmt 0)

Okay, I'm confused, sorry. I've yet to register anywhere, to either opt-in or opt-out of anything. A few hours ago, a UA named "MJ12bot" from two disparate servers hit two sites 90 minutes apart, each time asking for robots.txt. Faked referers contained different 27-number strings and each site's name --

SITE #1:

UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.4; [majestic12.co.uk...]
Host: [yada-yada].craw.blueyonder.co.uk

Referers:

[majestic12.co.uk...]
[majestic12.co.uk...]

SITE #2:

UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]
IP: 205.209.170.nn

Referer:

[majestic12.co.uk...]

QUESTIONS:

If those two different 27-character strings (10 nos hyphen 16 nos) were/are each site's 'key' -- how'd two, erm, somebody elses get them? Or were those just more fake bots and fake referers? Or--?

TIA

[edited by: incrediBILL at 8:15 am (utc) on Sep. 7, 2009]
[edit reason] clean up [/edit]

Lord Majestic




msg:3985673
 7:18 pm on Sep 7, 2009 (gmt 0)

If those two different 27-character strings (10 nos hyphen 16 nos) were/are each site's 'key'

These numbers are not site keys - they are debug params that we only send when getting robots.txt, it allows us to identify "bucket" in which those urls were as well as exact crawler that crawler it. A bucket can contain up to 10k urls from all sort of sites, and same domain can be present in multiple "buckets", hence these numbers can't be used for crawler/site identification even though I've never seen fakes show such referer (they usually don't bother getting robots.txt in the first place).

The new "ident" feature is user-configurable and it will be send consistently by new version of the bot: v1.3.0 (or higher).

This 38 message thread spans 2 pages: < < 38 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved