homepage Welcome to WebmasterWorld Guest from 54.205.52.110
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 4 [5] 6 > >     
MSN's many cloaked bots. Again.
Pfui




msg:4182832
 11:44 pm on Aug 5, 2010 (gmt 0)

Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

 

tangor




msg:4334980
 6:14 am on Jul 5, 2011 (gmt 0)

Color me confused... what SIZE are these 3pgs or 4pgs pages? HOW OFTEN? (weekly)
I like to be indexed weekly... but most of my pages go a bit slower (bigger than most).

MSNbot/Bing come from many IPs, but most only ask for a dozen pages at a time... not the whole site (my experience)

And on pages that have not been updated a nice 304 is given, not the page...

Have you investigated throttling rapid access?

AlexK




msg:4335066
 10:58 am on Jul 5, 2011 (gmt 0)

Every unchanged Status 200 page is given a 304.

The blocked pages in these reports are *not* given a 200. They are given a 403 (fast scrape) or a 503 (slow scrape), plus a tiny text notice. Blocked bots are refused for a week (bit bigger text notice). If they attempt another fast-scrape again during this time, the timer is reset to zero.

Periodicity varies according to the bot. G has been daily for 9 years (and well-behaved until yesterday). Others vary.

I've investigated throttling rapid access, but have not activated (apart from the bot-blocker, which is a rapid-access blocker, of course). Fast bots run out of pages to request pretty quick, anyway.

I would suggest that the main difference between my & (most?) other sites is that mine tests for, and then blocks-records-reports, abusive activity. If you do not test for it, how can you know whether it is happening or not?

dstiles




msg:4335374
 8:15 pm on Jul 5, 2011 (gmt 0)

I record all legit bot hits in a specific composite-site log (ie across the whole server in one log). I view this log several times a day. I would notice if the rate were more than two or three pages per second.

The same applies to the major bot companies that are using non-bot rDNS - I log all "bad" site hits including scrapers and server farms.

My experience is that the major bots tend not to hit the same IP at the same time: ie they scan one site, wait a while, then scan another site. The msnbot specifically scans a whole site at one sitting (I have no delay factors in most of my robots.txt files) then comes back a few minutes or hours later for another site.

NOTE: This only applies to web PAGES. It does not include images, css, js etc.

g1smd




msg:4336062
 10:43 pm on Jul 6, 2011 (gmt 0)

@AlexK: Bing are on the case: [webmasterworld.com...]
AlexK




msg:4336190
 5:56 am on Jul 7, 2011 (gmt 0)

Thanks for that, g1smd.

6 July abusive activity:
msnbot-65-52-110-43.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,143 total pages
msnbot-207-46-13-47.search.msn.com [forums.modem-help.co.uk] : max 7 pages / sec; 25 total pages

I'll believe it when I see it.

tangor




msg:4336262
 10:22 am on Jul 7, 2011 (gmt 0)

how can you know whether it is happening or not?

Log files reveal all :)

g1smd




msg:4337236
 11:26 pm on Jul 8, 2011 (gmt 0)

Bingdude is out of the building for a couple of weeks, according to another thread here at WebmasterWorld.

I think that time should be used to gather and present a comprehensive list of the detailed issues here in this thread.

AlexK




msg:4337372
 12:48 pm on Jul 9, 2011 (gmt 0)

AlexK:
I'll believe it when I see it.

There have been zero reports from my abuse system on any MSN bot since my last report (see previous post) on 2011-07-06 16:45:56 until the latest at 2011-07-09 05:00:01. It looks like M$ may have pulled the plug on this abusive behaviour by their bots.

g1smd




msg:4337526
 10:44 pm on Jul 9, 2011 (gmt 0)

If that's still true after ~7 days, I'd call it a result.

Next.

AlexK




msg:4337593
 9:38 am on Jul 10, 2011 (gmt 0)

After seeing Pfui's postings across all these months since last August, I'd call it a temporary result (but a result none the less).

AlexK




msg:4340475
 12:07 pm on Jul 17, 2011 (gmt 0)

No more problems from msn/bingbot. Nice.

Here's an interesting one:

crawl-31-192-104-174.googlebot.com [forums.modem-help.co.uk] : max: 8 pages / sec; 98 pages

Seasoned webmasters will immediately spot that this has nothing to do with Google (unlike my earlier report). The IP resolves to Navitel Rusconnect Ltd [cidr-report.org] (ASN=AS49335). Just another forgery.

Pfui




msg:4340490
 1:02 pm on Jul 17, 2011 (gmt 0)

About MSN: Great news!

About the G forgery: That one's getting around... [stopforumspam.com...]

(FWIW: DT's WHOIS shows the Russian Fed IP location as Mir Telematiki Ltd.)

I'm amazed anyone anywhere was able to resolve 31.192.104.174 to crawl-31-192-104-174.googlebot.com for more than a few minutes before G ripped the wheels off its wagon. Another reason to limit the major's UAs to the majors' IPs and vice-versa.

AlexK




msg:4340554
 6:26 pm on Jul 17, 2011 (gmt 0)

@Pfui:
Another reason to limit the major's UAs to the majors' IPs and vice-versa

Yes. One of the reasons for me reporting it. Folks often need to see the evidence to get the point.

Personally, I do not place any trust in the `major's' IPs. My site is coded to act on behaviour, not rep. Hence reporting MSN in the first place. That was underlined by your OP in this thread all those years ago.

g1smd




msg:4340562
 7:08 pm on Jul 17, 2011 (gmt 0)

If something turns up using any well-known searchengine User Agent but the IP address is wrong, it gets no chance to do anything.

Everything else gets judged on its behaviour when accessing the site, including accesses from searchengine IPs but with non-SE User Agents.

Pfui




msg:4340587
 8:47 pm on Jul 17, 2011 (gmt 0)

Ditto. And speaking of 403s for cloaking's-not-okay behavior...

MSN's still in the Twitter-swarming, IP-only, fake UA game. These just in, nine seconds apart:

65.52.22.174
65.52.6.105

UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

g1smd




msg:4347468
 12:26 am on Aug 4, 2011 (gmt 0)

This went a bit nuts, today pulling many hundreds of pages, with a gap of 60 to 150 seconds between each one, from
157.55.116.nnn:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

What's with the trailing junk?

Pfui




msg:4347500
 2:06 am on Aug 4, 2011 (gmt 0)

The underscore variant's been around for too long. I block it, with no deleterious effect.

dstiles




msg:4347947
 9:02 pm on Aug 4, 2011 (gmt 0)

Same here. I have also mentioned this UA problem and non-DNS bot IPs to bingdude: still waiting for an answer. :(

Pfui




msg:4349481
 1:43 pm on Aug 9, 2011 (gmt 0)

Talk about dja vu all over again...

-----
Aug. 5, 2010 (OP)
Sep. 16, 2010
Nov. 10, 2010
-----
Those posts detail hits from bare MSN IPs. No UAs, no robots.txt, no REFs, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file each time.

Eerily, almost exactly a year to the day after my OP:

-----
Aug. 9, 2011
-----
This post details hits from a bare MSN IP. No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven. To a single file.

65.52.39.40
-
08/090n:00:53 /dir/filename.html
08/090n:01:03 /dir/filename.html
08/090n:01:14 /dir/filename.html
08/090n:01:25 /dir/filename.html
08/090n:01:35 /dir/filename.html
08/090n:01:46 /dir/filename.html
08/090n:01:57 /dir/filename.html
08/090n:02:07 /dir/filename.html
08/090n:02:18 /dir/filename.html
08/090n:02:29 /dir/filename.html
08/090n:02:40 /dir/filename.html

All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

AGAIN.

Pfui




msg:4352873
 10:57 pm on Aug 17, 2011 (gmt 0)

Ironically, while MSN remains mum on running its 'trailing underscore' version of msnbot, the same so-called version was recently faked by a SEO domain (shame on them).

From Project Honey Pot for 72.22.68.30 [projecthoneypot.org]:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

I found that because the same IP recently hit me from its Host name with yet another MSN fakery:

cpanel2.seor*nk.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

Just more examples of why limiting major SE UAs to their respective Hosts/IPs is a Good Thing. Ditto blocking Hosts with SEO in their names...

dstiles




msg:4353208
 7:24 pm on Aug 18, 2011 (gmt 0)

I've had iPower blocked for yonks.

Last from bingdude a week-ish ago: MS are still looking into the IP and UA problems. At least we're getting feedback from MSN, which is more than we get from google: couple of visits to promote G's viewpoint and disappear again! :(

lucy24




msg:4354064
 10:58 pm on Aug 21, 2011 (gmt 0)

Holy ###. I was all set to start a thread on the Bingbot's new clothes, and here it's been going on for a year and I'm just late to the party. Here's my version anyway (cut&paste from draft because I had to pull in and edit a bunch of logs):
____

I just noticed this today; checked back and found a few more over the last couple of days. The first occurrence I found was on the 10th, and then nothing else back to mid-July when I got tired of looking. (It's too complicated for Spotlight, so you have to open the individual log files. Yawn.)

I have to put this first attempt in full:

65.52.32.124 - - [10/Aug/2011:16:49:10] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2303 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:20] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:30] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:40] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:49:51] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:01] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:11] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"
65.52.32.124 - - [10/Aug/2011:16:50:21] "GET /fun/AlonzoMelissa.html HTTP/1.1" 403 2247 "-" "-"


A bit slow on the uptake, are we, bingbot? (So was I, because at the time I must have just glanced at the series of 403s and at the requested title and assumed it was my Ukrainians, forgetting that they now get a 301.) Blank UAs get slapped with an automatic 403. No use saying "But I'm from Bing! Honest! I just didn't make it to the laundromat in time!"

After that, they must have got the message, because when they tried it again there was a pattern:

207.46.199.193 - - [19/Aug/2011:21:50:55] "GET /robots.txt HTTP/1.1" 200 806 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
157.55.50.11 - - [19/Aug/2011:21:57:53] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:09] {robots.txt} "-" {bingbot}
65.52.108.24 - - [19/Aug/2011:22:09:46] {one html file} "-" "-"
65.52.108.24 - - [19/Aug/2011:23:26:46] {another html file} "-" "-"


157.55.16.87 - - [20/Aug/2011:18:31:57] {robots.txt} "-" {bingbot}
157.55.16.87 - - [20/Aug/2011:18:32:29] {one html file} "-" "-"


Then, to show that they still know how to do it right:

65.52.104.21 - - [20/Aug/2011:21:22:07] {robots.txt} 301 "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:08] {robots.txt} "-" {bingbot}
65.52.104.21 - - [20/Aug/2011:21:22:46] {one html file} 301 "-" {bingbot}


Logs don't say, but the 301 here is because they were aiming for a without-www URL. Note that they didn't follow the 301 and pick up the actual file. This is normal for the bingbot on my site. (I got curious. They picked up the correctly named file back in early July, so presumably said "Naah, not worth the trouble" this time around.)

157.55.16.87 - - [21/Aug/2011:02:15:02] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:02:15:59] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:06:11:14] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:06:12:11] {one html file} "-" "-"

157.55.16.87 - - [21/Aug/2011:12:30:36] {robots.txt} "-" {bingbot}
157.55.16.87 - - [21/Aug/2011:12:31:29] {one html file} "-" "-"


Does this make any sense? At all? Whatsoever?

Pfui




msg:4354169
 11:19 am on Aug 22, 2011 (gmt 0)

Actually, I've yet to have a problem with the official bingbot run from .search.msn.com. What I dislike, and deny, are MSN's cloaked -- bare-IP, no-UA -- no-robots.txt, rapid-fire, multi-hits to single files.

As re-re-reported, the 11-hit sets just keep on comin':

65.52.32.71
-
08/2n01:42:53 /dir/filename.html
08/2n01:43:04 /dir/filename.html
08/2n01:43:14 /dir/filename.html
08/2n01:43:25 /dir/filename.html
08/2n01:43:36 /dir/filename.html
08/2n01:43:46 /dir/filename.html
08/2n01:43:57 /dir/filename.html
08/2n01:44:08 /dir/filename.html
08/2n01:44:19 /dir/filename.html
08/2n01:44:29 /dir/filename.html
08/2n01:44:40 /dir/filename.html

Log format (FWIW). Always the same:

65.52.32.71 - - [2n/Aug/2011:01:42:53 -0n00] "GET /dir/filename.html HTTP/1.1" 403 1486 "-" "-"

Pfui




msg:4355887
 4:53 pm on Aug 27, 2011 (gmt 0)

The following two --

bare-IP (no rDNS)
cloaked-UA
same-second
no-robots.txt

-- hits by MSN were part of a post-tweet swarm to the exact same file:

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

65.52.17.111
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
08/27 09:34:25 /dir2/filename.html

Why any major SE needs to repeatedly abuse the very rules they expect us to follow is beyond me. I guess it's 'because they can.'

Pfui




msg:4356099
 9:09 pm on Aug 28, 2011 (gmt 0)

More confirmation today; another tweeted link, additional cloaked hits:

65.52.21.72
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.6.105
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Four minutes apart, this time.

dstiles




msg:4356380
 6:59 pm on Aug 29, 2011 (gmt 0)

I don't think that UA could be valid. Most IE UAs have loads of .NET and stuff in them. Has to be some kind of bot, whether MS or not I don't know.

Pfui




msg:4364213
 11:18 am on Sep 19, 2011 (gmt 0)

Same ol', same ol'. Eleven cloaked hits to one file -- no rDNS, no UA, no robots.txt

65.52.33.119

09/1n03:49:57 /dir/filename.html
09/1n03:50:07 /dir/filename.html
09/1n03:50:18 /dir/filename.html
09/1n03:50:29 /dir/filename.html
09/1n03:50:39 /dir/filename.html
09/1n03:50:50 /dir/filename.html
09/1n03:51:01 /dir/filename.html
09/1n03:51:12 /dir/filename.html
09/1n03:51:22 /dir/filename.html
09/1n03:51:33 /dir/filename.html
09/1n03:51:44 /dir/filename.html

Pfui




msg:4369406
 3:33 am on Oct 1, 2011 (gmt 0)

Somehow someone somewhere at Microsoft thinks this is efficient. Surrrrrrrrrrre.

70.37.161.15

09/301n:39:33 /dir/filename.html
09/301n:39:44 /dir/filename.html
09/301n:39:55 /dir/filename.html
09/301n:40:06 /dir/filename.html
09/301n:40:17 /dir/filename.html
09/301n:40:27 /dir/filename.html
09/301n:40:38 /dir/filename.html
09/301n:40:49 /dir/filename.html
09/301n:41:00 /dir/filename.html
09/301n:41:10 /dir/filename.html
09/301n:41:21 /dir/filename.html

Pfui




msg:4369581
 7:00 pm on Oct 1, 2011 (gmt 0)

Aside from its name, I've still no idea what it does or what it will do or why it's now no-rDNS:

157.55.16.46
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:18 /robots.txt
10/01 1n:08:19 /

157.55.17.98
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
10/01 1n:08:20 /robots.txt
10/01 1n:08:22 /filename.html

dstiles




msg:4369586
 7:40 pm on Oct 1, 2011 (gmt 0)

I think it's time bingdude popped in here. Haven't seen him in the bing forum lately, either. Hope he hasn't done a google-man and hopped it! :(

Pfui




msg:4369622
 12:16 am on Oct 2, 2011 (gmt 0)

This just in --

msnbot-65-52-104-116.search.msn.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

robots.txt? NO

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 4 [5] 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved