Welcome to WebmasterWorld Guest from 54.196.116.152

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

MSN's many cloaked bots. Again.

     
11:44 pm on Aug 5, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.
6:14 am on July 5, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6964
votes: 385


Color me confused... what SIZE are these 3pgs or 4pgs pages? HOW OFTEN? (weekly)
I like to be indexed weekly... but most of my pages go a bit slower (bigger than most).

MSNbot/Bing come from many IPs, but most only ask for a dozen pages at a time... not the whole site (my experience)

And on pages that have not been updated a nice 304 is given, not the page...

Have you investigated throttling rapid access?
10:58 am on July 5, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


Every unchanged Status 200 page is given a 304.

The blocked pages in these reports are *not* given a 200. They are given a 403 (fast scrape) or a 503 (slow scrape), plus a tiny text notice. Blocked bots are refused for a week (bit bigger text notice). If they attempt another fast-scrape again during this time, the timer is reset to zero.

Periodicity varies according to the bot. G has been daily for 9 years (and well-behaved until yesterday). Others vary.

I've investigated throttling rapid access, but have not activated (apart from the bot-blocker, which is a rapid-access blocker, of course). Fast bots run out of pages to request pretty quick, anyway.

I would suggest that the main difference between my & (most?) other sites is that mine tests for, and then blocks-records-reports, abusive activity. If you do not test for it, how can you know whether it is happening or not?
8:15 pm on July 5, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


I record all legit bot hits in a specific composite-site log (ie across the whole server in one log). I view this log several times a day. I would notice if the rate were more than two or three pages per second.

The same applies to the major bot companies that are using non-bot rDNS - I log all "bad" site hits including scrapers and server farms.

My experience is that the major bots tend not to hit the same IP at the same time: ie they scan one site, wait a while, then scan another site. The msnbot specifically scans a whole site at one sitting (I have no delay factors in most of my robots.txt files) then comes back a few minutes or hours later for another site.

NOTE: This only applies to web PAGES. It does not include images, css, js etc.
10:43 pm on July 6, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


@AlexK
: Bing are on the case: [webmasterworld.com...]
5:56 am on July 7, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


Thanks for that, g1smd.

6 July abusive activity:
msnbot-65-52-110-43.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,143 total pages
msnbot-207-46-13-47.search.msn.com [forums.modem-help.co.uk] : max 7 pages / sec; 25 total pages

I'll believe it when I see it.
10:22 am on July 7, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6964
votes: 385


how can you know whether it is happening or not?

Log files reveal all :)
11:26 pm on July 8, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Bingdude is out of the building for a couple of weeks, according to another thread here at WebmasterWorld.

I think that time should be used to gather and present a comprehensive list of the detailed issues here in this thread.
12:48 pm on July 9, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


AlexK:
I'll believe it when I see it.

There have been zero reports from my abuse system on any MSN bot since my last report (see previous post) on 2011-07-06 16:45:56 until the latest at 2011-07-09 05:00:01. It looks like M$ may have pulled the plug on this abusive behaviour by their bots.
10:44 pm on July 9, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If that's still true after ~7 days, I'd call it a result.

Next.
9:38 am on July 10, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


After seeing Pfui's postings across all these months since last August, I'd call it a temporary result (but a result none the less).
12:07 pm on July 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


No more problems from msn/bingbot. Nice.

Here's an interesting one:

crawl-31-192-104-174.googlebot.com [forums.modem-help.co.uk] : max: 8 pages / sec; 98 pages

Seasoned webmasters will immediately spot that this has nothing to do with Google (unlike my earlier report). The IP resolves to Navitel Rusconnect Ltd [cidr-report.org] (ASN=AS49335). Just another forgery.
1:02 pm on July 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


About MSN: Great news!

About the G forgery: That one's getting around... [stopforumspam.com...]

(FWIW: DT's WHOIS shows the Russian Fed IP location as Mir Telematiki Ltd.)

I'm amazed anyone anywhere was able to resolve 31.192.104.174 to crawl-31-192-104-174.googlebot.com for more than a few minutes before G ripped the wheels off its wagon. Another reason to limit the major's UAs to the majors' IPs and vice-versa.
6:26 pm on July 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2004
posts:660
votes: 0


@Pfui:
Another reason to limit the major's UAs to the majors' IPs and vice-versa

Yes. One of the reasons for me reporting it. Folks often need to see the evidence to get the point.

Personally, I do not place any trust in the `major's' IPs. My site is coded to act on behaviour, not rep. Hence reporting MSN in the first place. That was underlined by your OP in this thread all those years ago.
7:08 pm on July 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If something turns up using any well-known searchengine User Agent but the IP address is wrong, it gets no chance to do anything.

Everything else gets judged on its behaviour when accessing the site, including accesses from searchengine IPs but with non-SE User Agents.
8:47 pm on July 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Ditto. And speaking of 403s for cloaking's-not-okay behavior...

MSN's still in the Twitter-swarming, IP-only, fake UA game. These just in, nine seconds apart:

65.52.22.174
65.52.6.105

UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO
12:26 am on Aug 4, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


This went a bit nuts, today pulling many hundreds of pages, with a gap of 60 to 150 seconds between each one, from
157.55.116.nnn
:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._


What's with the trailing junk?
2:06 am on Aug 4, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


The underscore variant's been around for too long. I block it, with no deleterious effect.
9:02 pm on Aug 4, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


Same here. I have also mentioned this UA problem and non-DNS bot IPs to bingdude: still waiting for an answer. :(
1:43 pm on Aug 9, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Talk about dja vu all over again...

-----
Aug. 5, 2010 (OP)
Sep. 16, 2010
Nov. 10, 2010
-----
Those posts detail hits from bare MSN IPs. No UAs, no robots.txt, no REFs, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file each time.

Eerily, almost exactly a year to the day after my OP:

-----
Aug. 9, 2011
-----
This post details hits from a bare MSN IP. No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file.

65.52.39.40
-
08/090n:00:53 /dir/filename.html
08/090n:01:03 /dir/filename.html
08/090n:01:14 /dir/filename.html
08/090n:01:25 /dir/filename.html
08/090n:01:35 /dir/filename.html
08/090n:01:46 /dir/filename.html
08/090n:01:57 /dir/filename.html
08/090n:02:07 /dir/filename.html
08/090n:02:18 /dir/filename.html
08/090n:02:29 /dir/filename.html
08/090n:02:40 /dir/filename.html

All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

AGAIN.
10:57 pm on Aug 17, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Ironically, while MSN remains mum on running its 'trailing underscore' version of msnbot, the same so-called version was recently faked by a SEO domain (shame on them).

From Project Honey Pot for 72.22.68.30 [projecthoneypot.org]:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

I found that because the same IP recently hit me from its Host name with yet another MSN fakery:

cpanel2.seor*nk.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

Just more examples of why limiting major SE UAs to their respective Hosts/IPs is a Good Thing. Ditto blocking Hosts with SEO in their names...
7:24 pm on Aug 18, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


I've had iPower blocked for yonks.

Last from bingdude a week-ish ago: MS are still looking into the IP and UA problems. At least we're getting feedback from MSN, which is more than we get from google: couple of visits to promote G's viewpoint and disappear again! :(
10:58 pm on Aug 21, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Holy ###. I was all set to start a thread on the Bingbot's new clothes, and here it's been going on for a year and I'm just late to the party. Here's my version anyway (cut&paste from draft because I had to pull in and edit a bunch of logs):
____

I just noticed this today; checked back and found a few more over the last couple of days. The first occurrence I found was on the 10th, and then nothing else back to mid-July when I got tired of looking. (It's too complicated for Spotlight, so you have to open the individual log files. Yawn.)

I have to put this first attempt in full:

65.52.32.124 - - [10/Aug/2011:16:49:10] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2303 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:20] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:30] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:40] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:51] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:01] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:11] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:21] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"


A bit slow on the uptake, are we, bingbot? (So was I, because at the time I must have just glanced at the series of 403s and at the requested title and assumed it was my Ukrainians, forgetting that they now get a 301.) Blank UAs get slapped with an automatic 403. No use saying "But I'm from Bing! Honest! I just didn't make it to the laundromat in time!"

After that, they must have got the message, because when they tried it again there was a pattern:

207.46.199.193 - - [19/Aug/2011:21:50:55] "GET /robots.txt HTTP/1.1" 200 806 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 
157.55.50.11 - - [19/Aug/2011:21:57:53] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:09] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:46] {one html file} "-" "-"
65.52.108.24 - - [19/Aug/2011:23:26:46] {another html file} "-" "-"


157.55.16.87 - - [20/Aug/2011:18:31:57] {robots.txt} "-" {bingbot}
157.55.16.87 - - [20/Aug/2011:18:32:29] {one html file} "-" "-"


Then, to show that they still know how to do it right:

65.52.104.21 - - [20/Aug/2011:21:22:07] {robots.txt} 301 "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:08] {robots.txt} "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:46] {one html file} 301 "-" {bingbot}


Logs don't say, but the 301 here is because they were aiming for a without-www URL. Note that they didn't follow the 301 and pick up the actual file. This is normal for the bingbot on my site. (I got curious. They picked up the correctly named file back in early July, so presumably said "Naah, not worth the trouble" this time around.)

157.55.16.87 - - [21/Aug/2011:02:15:02] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:02:15:59] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:06:11:14] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:06:12:11] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:12:30:36] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:12:31:29] {one html file} "-" "-"


Does this make any sense? At all? Whatsoever?
11:19 am on Aug 22, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Actually, I've yet to have a problem with the official bingbot run from .search.msn.com. What I dislike, and deny, are MSN's cloaked -- bare-IP, no-UA -- no-robots.txt, rapid-fire, multi-hits to single files.

As re-re-reported, the 11-hit sets just keep on comin':

65.52.32.71
-
08/2n01:42:53 /dir/filename.html
08/2n01:43:04 /dir/filename.html
08/2n01:43:14 /dir/filename.html
08/2n01:43:25 /dir/filename.html
08/2n01:43:36 /dir/filename.html
08/2n01:43:46 /dir/filename.html
08/2n01:43:57 /dir/filename.html
08/2n01:44:08 /dir/filename.html
08/2n01:44:19 /dir/filename.html
08/2n01:44:29 /dir/filename.html
08/2n01:44:40 /dir/filename.html

Log format (FWIW). Always the same:

65.52.32.71 - - [2n/Aug/2011:01:42:53 -0n00] "GET /dir/filename.html HTTP/1.1" 403 1486 "-" "-"
4:53 pm on Aug 27, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


The following two --

bare-IP (no rDNS)
cloaked-UA
same-second
no-robots.txt

-- hits by MSN were part of a post-tweet swarm to the exact same file:

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

65.52.17.111
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

Why any major SE needs to repeatedly abuse the very rules they expect us to follow is beyond me. I guess it's 'because they can.'
9:09 pm on Aug 28, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


More confirmation today; another tweeted link, additional cloaked hits:

65.52.21.72
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Four minutes apart, this time.
6:59 pm on Aug 29, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


I don't think that UA could be valid. Most IE UAs have loads of .NET and stuff in them. Has to be some kind of bot, whether MS or not I don't know.
11:18 am on Sept 19, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Same ol', same ol'. Eleven cloaked hits to one file -- no rDNS, no UA, no robots.txt

65.52.33.119

09/1n03:49:57 /dir/filename.html
09/1n03:50:07 /dir/filename.html
09/1n03:50:18 /dir/filename.html
09/1n03:50:29 /dir/filename.html
09/1n03:50:39 /dir/filename.html
09/1n03:50:50 /dir/filename.html
09/1n03:51:01 /dir/filename.html
09/1n03:51:12 /dir/filename.html
09/1n03:51:22 /dir/filename.html
09/1n03:51:33 /dir/filename.html
09/1n03:51:44 /dir/filename.html
3:33 am on Oct 1, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Somehow someone somewhere at Microsoft thinks this is efficient. Surrrrrrrrrrre.

70.37.161.15

09/301n:39:33 /dir/filename.html
09/301n:39:44 /dir/filename.html
09/301n:39:55 /dir/filename.html
09/301n:40:06 /dir/filename.html
09/301n:40:17 /dir/filename.html
09/301n:40:27 /dir/filename.html
09/301n:40:38 /dir/filename.html
09/301n:40:49 /dir/filename.html
09/301n:41:00 /dir/filename.html
09/301n:41:10 /dir/filename.html
09/301n:41:21 /dir/filename.html
7:00 pm on Oct 1, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Aside from its name, I've still no idea what it does or what it will do or why it's now no-rDNS:

157.55.16.46
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:18 /robots.txt
10/01 1n:08:19 /

157.55.17.98
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:20 /robots.txt
10/01 1n:08:22 /filename.html
7:40 pm on Oct 1, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


I think it's time bingdude popped in here. Haven't seen him in the bing forum lately, either. Hope he hasn't done a google-man and hopped it! :(
This 152 message thread spans 6 pages: 152