Welcome to WebmasterWorld Guest from 54.159.111.156

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

MSN's many cloaked bots. Again.

     
11:44 pm on Aug 5, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.
6:14 am on Jul 5, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Color me confused... what SIZE are these 3pgs or 4pgs pages? HOW OFTEN? (weekly)
I like to be indexed weekly... but most of my pages go a bit slower (bigger than most).

MSNbot/Bing come from many IPs, but most only ask for a dozen pages at a time... not the whole site (my experience)

And on pages that have not been updated a nice 304 is given, not the page...

Have you investigated throttling rapid access?
10:58 am on Jul 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Every unchanged Status 200 page is given a 304.

The blocked pages in these reports are *not* given a 200. They are given a 403 (fast scrape) or a 503 (slow scrape), plus a tiny text notice. Blocked bots are refused for a week (bit bigger text notice). If they attempt another fast-scrape again during this time, the timer is reset to zero.

Periodicity varies according to the bot. G has been daily for 9 years (and well-behaved until yesterday). Others vary.

I've investigated throttling rapid access, but have not activated (apart from the bot-blocker, which is a rapid-access blocker, of course). Fast bots run out of pages to request pretty quick, anyway.

I would suggest that the main difference between my & (most?) other sites is that mine tests for, and then blocks-records-reports, abusive activity. If you do not test for it, how can you know whether it is happening or not?
8:15 pm on Jul 5, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I record all legit bot hits in a specific composite-site log (ie across the whole server in one log). I view this log several times a day. I would notice if the rate were more than two or three pages per second.

The same applies to the major bot companies that are using non-bot rDNS - I log all "bad" site hits including scrapers and server farms.

My experience is that the major bots tend not to hit the same IP at the same time: ie they scan one site, wait a while, then scan another site. The msnbot specifically scans a whole site at one sitting (I have no delay factors in most of my robots.txt files) then comes back a few minutes or hours later for another site.

NOTE: This only applies to web PAGES. It does not include images, css, js etc.
10:43 pm on Jul 6, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@AlexK
: Bing are on the case: [webmasterworld.com...]
5:56 am on Jul 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that, g1smd.

6 July abusive activity:
msnbot-65-52-110-43.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,143 total pages
msnbot-207-46-13-47.search.msn.com [forums.modem-help.co.uk] : max 7 pages / sec; 25 total pages

I'll believe it when I see it.
10:22 am on Jul 7, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



how can you know whether it is happening or not?

Log files reveal all :)
11:26 pm on Jul 8, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Bingdude is out of the building for a couple of weeks, according to another thread here at WebmasterWorld.

I think that time should be used to gather and present a comprehensive list of the detailed issues here in this thread.
12:48 pm on Jul 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



AlexK:
I'll believe it when I see it.

There have been zero reports from my abuse system on any MSN bot since my last report (see previous post) on 2011-07-06 16:45:56 until the latest at 2011-07-09 05:00:01. It looks like M$ may have pulled the plug on this abusive behaviour by their bots.
10:44 pm on Jul 9, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If that's still true after ~7 days, I'd call it a result.

Next.
9:38 am on Jul 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



After seeing Pfui's postings across all these months since last August, I'd call it a temporary result (but a result none the less).
12:07 pm on Jul 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No more problems from msn/bingbot. Nice.

Here's an interesting one:

crawl-31-192-104-174.googlebot.com [forums.modem-help.co.uk] : max: 8 pages / sec; 98 pages

Seasoned webmasters will immediately spot that this has nothing to do with Google (unlike my earlier report). The IP resolves to Navitel Rusconnect Ltd [cidr-report.org] (ASN=AS49335). Just another forgery.
1:02 pm on Jul 17, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



About MSN: Great news!

About the G forgery: That one's getting around... [stopforumspam.com...]

(FWIW: DT's WHOIS shows the Russian Fed IP location as Mir Telematiki Ltd.)

I'm amazed anyone anywhere was able to resolve 31.192.104.174 to crawl-31-192-104-174.googlebot.com for more than a few minutes before G ripped the wheels off its wagon. Another reason to limit the major's UAs to the majors' IPs and vice-versa.
6:26 pm on Jul 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@Pfui:
Another reason to limit the major's UAs to the majors' IPs and vice-versa

Yes. One of the reasons for me reporting it. Folks often need to see the evidence to get the point.

Personally, I do not place any trust in the `major's' IPs. My site is coded to act on behaviour, not rep. Hence reporting MSN in the first place. That was underlined by your OP in this thread all those years ago.
7:08 pm on Jul 17, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If something turns up using any well-known searchengine User Agent but the IP address is wrong, it gets no chance to do anything.

Everything else gets judged on its behaviour when accessing the site, including accesses from searchengine IPs but with non-SE User Agents.
8:47 pm on Jul 17, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Ditto. And speaking of 403s for cloaking's-not-okay behavior...

MSN's still in the Twitter-swarming, IP-only, fake UA game. These just in, nine seconds apart:

65.52.22.174
65.52.6.105

UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO
12:26 am on Aug 4, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This went a bit nuts, today pulling many hundreds of pages, with a gap of 60 to 150 seconds between each one, from
157.55.116.nnn
:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._


What's with the trailing junk?
2:06 am on Aug 4, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The underscore variant's been around for too long. I block it, with no deleterious effect.
9:02 pm on Aug 4, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Same here. I have also mentioned this UA problem and non-DNS bot IPs to bingdude: still waiting for an answer. :(
1:43 pm on Aug 9, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Talk about dja vu all over again...

-----
Aug. 5, 2010 (OP)
Sep. 16, 2010
Nov. 10, 2010
-----
Those posts detail hits from bare MSN IPs. No UAs, no robots.txt, no REFs, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file each time.

Eerily, almost exactly a year to the day after my OP:

-----
Aug. 9, 2011
-----
This post details hits from a bare MSN IP. No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file.

65.52.39.40
-
08/090n:00:53 /dir/filename.html
08/090n:01:03 /dir/filename.html
08/090n:01:14 /dir/filename.html
08/090n:01:25 /dir/filename.html
08/090n:01:35 /dir/filename.html
08/090n:01:46 /dir/filename.html
08/090n:01:57 /dir/filename.html
08/090n:02:07 /dir/filename.html
08/090n:02:18 /dir/filename.html
08/090n:02:29 /dir/filename.html
08/090n:02:40 /dir/filename.html

All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

AGAIN.
10:57 pm on Aug 17, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Ironically, while MSN remains mum on running its 'trailing underscore' version of msnbot, the same so-called version was recently faked by a SEO domain (shame on them).

From Project Honey Pot for 72.22.68.30 [projecthoneypot.org]:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

I found that because the same IP recently hit me from its Host name with yet another MSN fakery:

cpanel2.seor*nk.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

Just more examples of why limiting major SE UAs to their respective Hosts/IPs is a Good Thing. Ditto blocking Hosts with SEO in their names...
7:24 pm on Aug 18, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I've had iPower blocked for yonks.

Last from bingdude a week-ish ago: MS are still looking into the IP and UA problems. At least we're getting feedback from MSN, which is more than we get from google: couple of visits to promote G's viewpoint and disappear again! :(
10:58 pm on Aug 21, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Holy ###. I was all set to start a thread on the Bingbot's new clothes, and here it's been going on for a year and I'm just late to the party. Here's my version anyway (cut&paste from draft because I had to pull in and edit a bunch of logs):
____

I just noticed this today; checked back and found a few more over the last couple of days. The first occurrence I found was on the 10th, and then nothing else back to mid-July when I got tired of looking. (It's too complicated for Spotlight, so you have to open the individual log files. Yawn.)

I have to put this first attempt in full:

65.52.32.124 - - [10/Aug/2011:16:49:10] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2303 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:20] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:30] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:40] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:51] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:01] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:11] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:21] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"


A bit slow on the uptake, are we, bingbot? (So was I, because at the time I must have just glanced at the series of 403s and at the requested title and assumed it was my Ukrainians, forgetting that they now get a 301.) Blank UAs get slapped with an automatic 403. No use saying "But I'm from Bing! Honest! I just didn't make it to the laundromat in time!"

After that, they must have got the message, because when they tried it again there was a pattern:

207.46.199.193 - - [19/Aug/2011:21:50:55] "GET /robots.txt HTTP/1.1" 200 806 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 
157.55.50.11 - - [19/Aug/2011:21:57:53] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:09] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:46] {one html file} "-" "-"
65.52.108.24 - - [19/Aug/2011:23:26:46] {another html file} "-" "-"


157.55.16.87 - - [20/Aug/2011:18:31:57] {robots.txt} "-" {bingbot}
157.55.16.87 - - [20/Aug/2011:18:32:29] {one html file} "-" "-"


Then, to show that they still know how to do it right:

65.52.104.21 - - [20/Aug/2011:21:22:07] {robots.txt} 301 "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:08] {robots.txt} "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:46] {one html file} 301 "-" {bingbot}


Logs don't say, but the 301 here is because they were aiming for a without-www URL. Note that they didn't follow the 301 and pick up the actual file. This is normal for the bingbot on my site. (I got curious. They picked up the correctly named file back in early July, so presumably said "Naah, not worth the trouble" this time around.)

157.55.16.87 - - [21/Aug/2011:02:15:02] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:02:15:59] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:06:11:14] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:06:12:11] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:12:30:36] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:12:31:29] {one html file} "-" "-"


Does this make any sense? At all? Whatsoever?
11:19 am on Aug 22, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Actually, I've yet to have a problem with the official bingbot run from .search.msn.com. What I dislike, and deny, are MSN's cloaked -- bare-IP, no-UA -- no-robots.txt, rapid-fire, multi-hits to single files.

As re-re-reported, the 11-hit sets just keep on comin':

65.52.32.71
-
08/2n01:42:53 /dir/filename.html
08/2n01:43:04 /dir/filename.html
08/2n01:43:14 /dir/filename.html
08/2n01:43:25 /dir/filename.html
08/2n01:43:36 /dir/filename.html
08/2n01:43:46 /dir/filename.html
08/2n01:43:57 /dir/filename.html
08/2n01:44:08 /dir/filename.html
08/2n01:44:19 /dir/filename.html
08/2n01:44:29 /dir/filename.html
08/2n01:44:40 /dir/filename.html

Log format (FWIW). Always the same:

65.52.32.71 - - [2n/Aug/2011:01:42:53 -0n00] "GET /dir/filename.html HTTP/1.1" 403 1486 "-" "-"
4:53 pm on Aug 27, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The following two --

bare-IP (no rDNS)
cloaked-UA
same-second
no-robots.txt

-- hits by MSN were part of a post-tweet swarm to the exact same file:

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

65.52.17.111
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

Why any major SE needs to repeatedly abuse the very rules they expect us to follow is beyond me. I guess it's 'because they can.'
9:09 pm on Aug 28, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



More confirmation today; another tweeted link, additional cloaked hits:

65.52.21.72
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Four minutes apart, this time.
6:59 pm on Aug 29, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I don't think that UA could be valid. Most IE UAs have loads of .NET and stuff in them. Has to be some kind of bot, whether MS or not I don't know.
11:18 am on Sep 19, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Same ol', same ol'. Eleven cloaked hits to one file -- no rDNS, no UA, no robots.txt

65.52.33.119

09/1n03:49:57 /dir/filename.html
09/1n03:50:07 /dir/filename.html
09/1n03:50:18 /dir/filename.html
09/1n03:50:29 /dir/filename.html
09/1n03:50:39 /dir/filename.html
09/1n03:50:50 /dir/filename.html
09/1n03:51:01 /dir/filename.html
09/1n03:51:12 /dir/filename.html
09/1n03:51:22 /dir/filename.html
09/1n03:51:33 /dir/filename.html
09/1n03:51:44 /dir/filename.html
3:33 am on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Somehow someone somewhere at Microsoft thinks this is efficient. Surrrrrrrrrrre.

70.37.161.15

09/301n:39:33 /dir/filename.html
09/301n:39:44 /dir/filename.html
09/301n:39:55 /dir/filename.html
09/301n:40:06 /dir/filename.html
09/301n:40:17 /dir/filename.html
09/301n:40:27 /dir/filename.html
09/301n:40:38 /dir/filename.html
09/301n:40:49 /dir/filename.html
09/301n:41:00 /dir/filename.html
09/301n:41:10 /dir/filename.html
09/301n:41:21 /dir/filename.html
7:00 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Aside from its name, I've still no idea what it does or what it will do or why it's now no-rDNS:

157.55.16.46
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:18 /robots.txt
10/01 1n:08:19 /

157.55.17.98
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:20 /robots.txt
10/01 1n:08:22 /filename.html
7:40 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I think it's time bingdude popped in here. Haven't seen him in the bing forum lately, either. Hope he hasn't done a google-man and hopped it! :(
This 152 message thread spans 6 pages: 152
 

Featured Threads

Hot Threads This Week

Hot Threads This Month