homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 4 [5] 6 > >     
MSN's many cloaked bots. Again.

 11:44 pm on Aug 5, 2010 (gmt 0)

Previously... [webmasterworld.com]

Currently, straight out of my logs... - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.



 6:14 am on Jul 5, 2011 (gmt 0)

Color me confused... what SIZE are these 3pgs or 4pgs pages? HOW OFTEN? (weekly)
I like to be indexed weekly... but most of my pages go a bit slower (bigger than most).

MSNbot/Bing come from many IPs, but most only ask for a dozen pages at a time... not the whole site (my experience)

And on pages that have not been updated a nice 304 is given, not the page...

Have you investigated throttling rapid access?


 10:58 am on Jul 5, 2011 (gmt 0)

Every unchanged Status 200 page is given a 304.

The blocked pages in these reports are *not* given a 200. They are given a 403 (fast scrape) or a 503 (slow scrape), plus a tiny text notice. Blocked bots are refused for a week (bit bigger text notice). If they attempt another fast-scrape again during this time, the timer is reset to zero.

Periodicity varies according to the bot. G has been daily for 9 years (and well-behaved until yesterday). Others vary.

I've investigated throttling rapid access, but have not activated (apart from the bot-blocker, which is a rapid-access blocker, of course). Fast bots run out of pages to request pretty quick, anyway.

I would suggest that the main difference between my & (most?) other sites is that mine tests for, and then blocks-records-reports, abusive activity. If you do not test for it, how can you know whether it is happening or not?


 8:15 pm on Jul 5, 2011 (gmt 0)

I record all legit bot hits in a specific composite-site log (ie across the whole server in one log). I view this log several times a day. I would notice if the rate were more than two or three pages per second.

The same applies to the major bot companies that are using non-bot rDNS - I log all "bad" site hits including scrapers and server farms.

My experience is that the major bots tend not to hit the same IP at the same time: ie they scan one site, wait a while, then scan another site. The msnbot specifically scans a whole site at one sitting (I have no delay factors in most of my robots.txt files) then comes back a few minutes or hours later for another site.

NOTE: This only applies to web PAGES. It does not include images, css, js etc.


 10:43 pm on Jul 6, 2011 (gmt 0)

@AlexK: Bing are on the case: [webmasterworld.com...]

 5:56 am on Jul 7, 2011 (gmt 0)

Thanks for that, g1smd.

6 July abusive activity:
msnbot-65-52-110-43.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,143 total pages
msnbot-207-46-13-47.search.msn.com [forums.modem-help.co.uk] : max 7 pages / sec; 25 total pages

I'll believe it when I see it.


 10:22 am on Jul 7, 2011 (gmt 0)

how can you know whether it is happening or not?

Log files reveal all :)


 11:26 pm on Jul 8, 2011 (gmt 0)

Bingdude is out of the building for a couple of weeks, according to another thread here at WebmasterWorld.

I think that time should be used to gather and present a comprehensive list of the detailed issues here in this thread.


 12:48 pm on Jul 9, 2011 (gmt 0)

I'll believe it when I see it.

There have been zero reports from my abuse system on any MSN bot since my last report (see previous post) on 2011-07-06 16:45:56 until the latest at 2011-07-09 05:00:01. It looks like M$ may have pulled the plug on this abusive behaviour by their bots.


 10:44 pm on Jul 9, 2011 (gmt 0)

If that's still true after ~7 days, I'd call it a result.



 9:38 am on Jul 10, 2011 (gmt 0)

After seeing Pfui's postings across all these months since last August, I'd call it a temporary result (but a result none the less).


 12:07 pm on Jul 17, 2011 (gmt 0)

No more problems from msn/bingbot. Nice.

Here's an interesting one:

crawl-31-192-104-174.googlebot.com [forums.modem-help.co.uk] : max: 8 pages / sec; 98 pages

Seasoned webmasters will immediately spot that this has nothing to do with Google (unlike my earlier report). The IP resolves to Navitel Rusconnect Ltd [cidr-report.org] (ASN=AS49335). Just another forgery.


 1:02 pm on Jul 17, 2011 (gmt 0)

About MSN: Great news!

About the G forgery: That one's getting around... [stopforumspam.com...]

(FWIW: DT's WHOIS shows the Russian Fed IP location as Mir Telematiki Ltd.)

I'm amazed anyone anywhere was able to resolve to crawl-31-192-104-174.googlebot.com for more than a few minutes before G ripped the wheels off its wagon. Another reason to limit the major's UAs to the majors' IPs and vice-versa.


 6:26 pm on Jul 17, 2011 (gmt 0)

Another reason to limit the major's UAs to the majors' IPs and vice-versa

Yes. One of the reasons for me reporting it. Folks often need to see the evidence to get the point.

Personally, I do not place any trust in the `major's' IPs. My site is coded to act on behaviour, not rep. Hence reporting MSN in the first place. That was underlined by your OP in this thread all those years ago.


 7:08 pm on Jul 17, 2011 (gmt 0)

If something turns up using any well-known searchengine User Agent but the IP address is wrong, it gets no chance to do anything.

Everything else gets judged on its behaviour when accessing the site, including accesses from searchengine IPs but with non-SE User Agents.


 8:47 pm on Jul 17, 2011 (gmt 0)

Ditto. And speaking of 403s for cloaking's-not-okay behavior...

MSN's still in the Twitter-swarming, IP-only, fake UA game. These just in, nine seconds apart:

UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO


 12:26 am on Aug 4, 2011 (gmt 0)

This went a bit nuts, today pulling many hundreds of pages, with a gap of 60 to 150 seconds between each one, from

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

What's with the trailing junk?


 2:06 am on Aug 4, 2011 (gmt 0)

The underscore variant's been around for too long. I block it, with no deleterious effect.


 9:02 pm on Aug 4, 2011 (gmt 0)

Same here. I have also mentioned this UA problem and non-DNS bot IPs to bingdude: still waiting for an answer. :(


 1:43 pm on Aug 9, 2011 (gmt 0)

Talk about dja vu all over again...

Aug. 5, 2010 (OP)
Sep. 16, 2010
Nov. 10, 2010
Those posts detail hits from bare MSN IPs. No UAs, no robots.txt, no REFs, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file each time.

Eerily, almost exactly a year to the day after my OP:

Aug. 9, 2011
This post details hits from a bare MSN IP. No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file.
08/090n:00:53 /dir/filename.html
08/090n:01:03 /dir/filename.html
08/090n:01:14 /dir/filename.html
08/090n:01:25 /dir/filename.html
08/090n:01:35 /dir/filename.html
08/090n:01:46 /dir/filename.html
08/090n:01:57 /dir/filename.html
08/090n:02:07 /dir/filename.html
08/090n:02:18 /dir/filename.html
08/090n:02:29 /dir/filename.html
08/090n:02:40 /dir/filename.html

All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.



 10:57 pm on Aug 17, 2011 (gmt 0)

Ironically, while MSN remains mum on running its 'trailing underscore' version of msnbot, the same so-called version was recently faked by a SEO domain (shame on them).

From Project Honey Pot for [projecthoneypot.org]:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

I found that because the same IP recently hit me from its Host name with yet another MSN fakery:

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

Just more examples of why limiting major SE UAs to their respective Hosts/IPs is a Good Thing. Ditto blocking Hosts with SEO in their names...


 7:24 pm on Aug 18, 2011 (gmt 0)

I've had iPower blocked for yonks.

Last from bingdude a week-ish ago: MS are still looking into the IP and UA problems. At least we're getting feedback from MSN, which is more than we get from google: couple of visits to promote G's viewpoint and disappear again! :(


 10:58 pm on Aug 21, 2011 (gmt 0)

Holy ###. I was all set to start a thread on the Bingbot's new clothes, and here it's been going on for a year and I'm just late to the party. Here's my version anyway (cut&paste from draft because I had to pull in and edit a bunch of logs):

I just noticed this today; checked back and found a few more over the last couple of days. The first occurrence I found was on the 10th, and then nothing else back to mid-July when I got tired of looking. (It's too complicated for Spotlight, so you have to open the individual log files. Yawn.)

I have to put this first attempt in full: - - [10/Aug/2011:16:49:10] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2303 "-" "-" - - [10/Aug/2011:16:49:20] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:49:30] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:49:40] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:49:51] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:50:01] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:50:11] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-" - - [10/Aug/2011:16:50:21] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"

A bit slow on the uptake, are we, bingbot? (So was I, because at the time I must have just glanced at the series of 403s and at the requested title and assumed it was my Ukrainians, forgetting that they now get a 301.) Blank UAs get slapped with an automatic 403. No use saying "But I'm from Bing! Honest! I just didn't make it to the laundromat in time!"

After that, they must have got the message, because when they tried it again there was a pattern: - - [19/Aug/2011:21:50:55] "GET /robots.txt HTTP/1.1" 200 806 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" - - [19/Aug/2011:21:57:53] {robots.txt} "-" {bingbot} - - [19/Aug/2011:22:09:09] {robots.txt} "-" {bingbot} - - [19/Aug/2011:22:09:46] {one html file} "-" "-" - - [19/Aug/2011:23:26:46] {another html file} "-" "-" - - [20/Aug/2011:18:31:57] {robots.txt} "-" {bingbot} - - [20/Aug/2011:18:32:29] {one html file} "-" "-"

Then, to show that they still know how to do it right: - - [20/Aug/2011:21:22:07] {robots.txt} 301 "-" {bingbot} - - [20/Aug/2011:21:22:08] {robots.txt} "-" {bingbot} - - [20/Aug/2011:21:22:46] {one html file} 301 "-" {bingbot}

Logs don't say, but the 301 here is because they were aiming for a without-www URL. Note that they didn't follow the 301 and pick up the actual file. This is normal for the bingbot on my site. (I got curious. They picked up the correctly named file back in early July, so presumably said "Naah, not worth the trouble" this time around.) - - [21/Aug/2011:02:15:02] {robots.txt} "-" {bingbot} - - [21/Aug/2011:02:15:59] {one html file} "-" "-" - - [21/Aug/2011:06:11:14] {robots.txt} "-" {bingbot} - - [21/Aug/2011:06:12:11] {one html file} "-" "-" - - [21/Aug/2011:12:30:36] {robots.txt} "-" {bingbot} - - [21/Aug/2011:12:31:29] {one html file} "-" "-"

Does this make any sense? At all? Whatsoever?


 11:19 am on Aug 22, 2011 (gmt 0)

Actually, I've yet to have a problem with the official bingbot run from .search.msn.com. What I dislike, and deny, are MSN's cloaked -- bare-IP, no-UA -- no-robots.txt, rapid-fire, multi-hits to single files.

As re-re-reported, the 11-hit sets just keep on comin':
08/2n01:42:53 /dir/filename.html
08/2n01:43:04 /dir/filename.html
08/2n01:43:14 /dir/filename.html
08/2n01:43:25 /dir/filename.html
08/2n01:43:36 /dir/filename.html
08/2n01:43:46 /dir/filename.html
08/2n01:43:57 /dir/filename.html
08/2n01:44:08 /dir/filename.html
08/2n01:44:19 /dir/filename.html
08/2n01:44:29 /dir/filename.html
08/2n01:44:40 /dir/filename.html

Log format (FWIW). Always the same: - - [2n/Aug/2011:01:42:53 -0n00] "GET /dir/filename.html HTTP/1.1" 403 1486 "-" "-"


 4:53 pm on Aug 27, 2011 (gmt 0)

The following two --

bare-IP (no rDNS)

-- hits by MSN were part of a post-tweet swarm to the exact same file:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

Why any major SE needs to repeatedly abuse the very rules they expect us to follow is beyond me. I guess it's 'because they can.'


 9:09 pm on Aug 28, 2011 (gmt 0)

More confirmation today; another tweeted link, additional cloaked hits:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Four minutes apart, this time.


 6:59 pm on Aug 29, 2011 (gmt 0)

I don't think that UA could be valid. Most IE UAs have loads of .NET and stuff in them. Has to be some kind of bot, whether MS or not I don't know.


 11:18 am on Sep 19, 2011 (gmt 0)

Same ol', same ol'. Eleven cloaked hits to one file -- no rDNS, no UA, no robots.txt

09/1n03:49:57 /dir/filename.html
09/1n03:50:07 /dir/filename.html
09/1n03:50:18 /dir/filename.html
09/1n03:50:29 /dir/filename.html
09/1n03:50:39 /dir/filename.html
09/1n03:50:50 /dir/filename.html
09/1n03:51:01 /dir/filename.html
09/1n03:51:12 /dir/filename.html
09/1n03:51:22 /dir/filename.html
09/1n03:51:33 /dir/filename.html
09/1n03:51:44 /dir/filename.html


 3:33 am on Oct 1, 2011 (gmt 0)

Somehow someone somewhere at Microsoft thinks this is efficient. Surrrrrrrrrrre.

09/301n:39:33 /dir/filename.html
09/301n:39:44 /dir/filename.html
09/301n:39:55 /dir/filename.html
09/301n:40:06 /dir/filename.html
09/301n:40:17 /dir/filename.html
09/301n:40:27 /dir/filename.html
09/301n:40:38 /dir/filename.html
09/301n:40:49 /dir/filename.html
09/301n:41:00 /dir/filename.html
09/301n:41:10 /dir/filename.html
09/301n:41:21 /dir/filename.html


 7:00 pm on Oct 1, 2011 (gmt 0)

Aside from its name, I've still no idea what it does or what it will do or why it's now no-rDNS:
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:18 /robots.txt
10/01 1n:08:19 /
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:20 /robots.txt
10/01 1n:08:22 /filename.html


 7:40 pm on Oct 1, 2011 (gmt 0)

I think it's time bingdude popped in here. Haven't seen him in the bing forum lately, either. Hope he hasn't done a google-man and hopped it! :(


 12:16 am on Oct 2, 2011 (gmt 0)

This just in --

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

robots.txt? NO

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 4 [5] 6 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved