Welcome to WebmasterWorld Guest from 54.226.27.104

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

MSN's many cloaked bots. Again.

     

Pfui

11:44 pm on Aug 5, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

Dijkgraaf

1:27 am on Oct 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore.

jdMorgan

1:52 am on Oct 27, 2010 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I for one am not yet sure that this is a real msnbot actually controlled by MSN. The "._" at the end of the UA string and the "different" value in the "From" header push me toward "no."

Everyone posting to this thread, please be explicit as to the full UA-string, the "From" header (if you log it), the IP address, and the rDNS if you check it.

Thanks,
Jim

Dijkgraaf

2:28 am on Oct 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In fact two of the topics should be in other threads, as this one started about MSN cloaked bots.
To discuss the MSN bot with the "._" at the end use [webmasterworld.com...]
And maybe start a new thread on msnbot requesting /logs or /access_logs

Pfui

5:38 am on Oct 27, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



@Dijkgraf,

1.) This thread's OP is my observing/complaining about hits from 'bare MSN IPs and not bona fide MSN bots'. Sure, threads overlap but I think subsequent posts are on-point with my OP because they're describing hits from, well, 'bare MSN IPs and not bona fide MSN bots' -- and that includes the iffy bot with the "._"

2.) You said: "add /logs and /access_log to robots.txt and you won't get any more hits on those" and similarly: "You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore."

I wish. But as thousands of posts in this forum attest, simply including dirs and files in robots.txt is definitely not a sure-fire way to eliminate hits to same, neither by bad bots nor, increasingly, 'good' ones.

Bottom line for me --

It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way. :)

Staffa

5:47 pm on Oct 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way.

Amen

and that goes for other Bots as well ;o)

Dijkgraaf

10:46 pm on Oct 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@Pfui

But that's the point. You have not told MSN or other bots that they are not allowed to ask for those URL's. With the current standard, anything that isn't explicitly disallowed is allowed. I've seen some very odd requests sometimes, even from GoogleBot including asking for some .exe files. I've put those down to either someone creating fake inbound links to get Google to scan for vulnerabilities or GoogleBot checking for a compromised sites.

Yes, there are bots that won't obey those rules, but those ones you will want to ban outright anyway.

I monitor the 404's occurring on my site, and if a bot starts re-visiting them I disallow it in my robots.txt file or give it a 301 to the resource they should be requesting. Either way it solves the problem, which isn't a big one to start of with, as all that would be happening is that they are getting a 404.

Samizdata

12:08 am on Oct 29, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



all that would be happening is that they are getting a 404

No, the bot also takes the robots.txt file.

If it is a genuine Microsoft bot - and I take Jim's point that the jury is still out on that - then phishing for access logs is disgraceful behaviour for a reputable company.

If it is a fake Microsoft bot using a genuine Microsoft IP then I certainly don't want it reading the robots.txt file or getting any information whatsoever.

Either way, I'd say a 403 is what it deserves.

...

Pfui

5:21 am on Oct 29, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Jim, about the iffy "._" bot -- see also my observations in this thread's predecessor, "MSN's many cloaked bots." (2009): [webmasterworld.com...] and yours in "Wanted: Crawler Quality Assurance Engineer" (2010): [webmasterworld.com...]

Pfui

2:18 am on Nov 1, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Doggoneit. MSN just ran bingbot from a bare IP:

157.55.16.229
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

keyplyr

3:01 am on Nov 1, 2010 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





MSN just ran bingbot from a bare IP


Thanks for the heads-up Pfui. Guess I was asleep at the wheel and blocked this one.

157.55.16.231 - - [30/Oct/2010:01:50:32 -0700] "GET www.example.com HTTP/1.1" 403 479 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Mokita

11:04 pm on Nov 5, 2010 (gmt 0)

5+ Year Member



Just seen coming from rdns msnbot-65-52-49-143.search.msn.com (65.52.49.143), requests for one page plus css, but no images.

UA was exactly: Mozilla/4.0 (compatible

(no closing bracket)

Pfui

9:27 am on Nov 10, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Akin to the OP, and also mssg #:4203105, yet another hit-and-run with no UA, no robots.txt, no REF, no nothing. To yet another file. Eleven times. For the third time!

65.52.32.17
-
11/10 00:15:45 /dir/filename.html
11/10 00:15:56 /dir/filename.html
11/10 00:16:07 /dir/filename.html
11/10 00:16:17 /dir/filename.html
11/10 00:16:28 /dir/filename.html
11/10 00:16:39 /dir/filename.html
11/10 00:16:50 /dir/filename.html
11/10 00:17:00 /dir/filename.html
11/10 00:17:11 /dir/filename.html
11/10 00:17:22 /dir/filename.html
11/10 00:17:32 /dir/filename.html

Too weird.

Pfui

3:32 am on Nov 11, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Same ol', same ol', the dreaded 65.52., albeit a different Host and UA than previously mentioned in this thread:

msnbot-65-52-50-54.search.msn.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707)

robots.txt? NO

Just went for root.

Staffa

11:19 pm on Nov 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I noticed a weird event today.
A regular visitor (via G.fr search) comes and views a few pages.

Next comes 207.46.204.nn (msnbot IP) with exactly the same UA as the previous visitor which was what caught my attention.

UA for both :
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; HPDTDF; Tablet PC 2.0; .NET4.0C; Creative AutoUpdate v1.40.01)

My log files are not public and it is either a mighty coincidence or ... a case where msn is following the visitor from France ?

Pfui

6:36 am on Nov 19, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Fresh from one of my logs, emphasis mine:

msnbot-65-52-49-143.search.msn.com - - [18/Nov/2010:22:24:15 -0800] "GET / HTTP/1.1" 403 1468 "-" "Mozilla/4.0 (compatible"

Crawlus interruptus?

Mokita

10:22 am on Nov 19, 2010 (gmt 0)

5+ Year Member



Pfui:

Exactly the same details as my post above on Nov 6, (including the rDNS).

Pfui

7:46 pm on Nov 19, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



(slaps head) Thank you for pointing that out:) Curious how it's even the same IP. And BvsB shows the exact same oddity, along with six out of seven UAs not identified as msnbot- or bing-related from 65.52.49.143: [botsvsbrowsers.com...]

Unfortunately that IP's not the only one using the truncated UA. Here are 22 more: [botsvsbrowsers.com...]

Clearly they know what they're doing. But until I know, non-search UAs from .search.msn.com Hosts/IPs will just keep getting 403s.

Transparency. What a concept. (Note the date...) [bing.com:80...]

Pfui

9:54 am on Nov 20, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



More bare IPs from 65.52. Hit in a post-tweet swarm, 20 minutes apart. All using:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.4.249
65.52.2.10
65.52.17.79

robots.txt? NO

Pfui

2:24 am on Nov 28, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Ten same-second hits to the same site. So much for MSN stating msnbot supports "Crawl-delay" in robots.txt: Crawl Delay And The Bing Crawler, MSNBot [bing.com]

msnbot-65-55-190-9.search.msn.com
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file1

65.55.190.19
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file2
11/27 03:45:34 /file3
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file4
11/27 03:45:34 /file5

Pfui

5:40 am on Dec 16, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Same Hostname caribguy reported above. At least asked for robots.txt and heeded same. But crawled 20 documents whereas bing/msnbot usually crawls 1 or 2 per server session. Also, the same weird UA was in play:

gig4-2.tuk2f-gsr-a.us.msn.net
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

dstiles

9:21 pm on Dec 16, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I block all those UAs. Apart from anything else they seem to come from non-rDNS IPs.

Pfui

9:09 am on Jan 17, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The odd one came around on the 16th --

msnbot-65-55-55-205.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

robots.txt? Yes but... "Crawl-delay" directive ignored.

-- and alternated with the usual suspects in a 15-minute period with hits from Hosts and bare IPs. Here's a partial listing:

msnbot-65-52-110-69.search.msn.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

msnbot-207-46-194-144.search.msn.com
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

157.55.16.230
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

FWIW

incrediBILL

10:19 pm on Jan 17, 2011 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'm guessing that bingbot from bare IPs is either:

- running from a test lab
- run via some MS cloud services
- being spoofed via some MS proxy

Doesn't matter, if the IP doesn't resolve to a crawler rDNS, I block it.

FYI - these bare IPs are showing up on Project Honeypot: [projecthoneypot.org...]

dstiles

9:45 pm on Feb 7, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Microsoft seems to have legitimized a few more bot IPs by setting up proper rDNS for them. The range below may not be ALL of the new ones but seems to be correct for 157.55.116 (although the bot is still coming round with the "invalid" underscore UA and ignoring robots.txt on this and other ranges).

157.55.116.7 - 157.55.116.97

Bots still hitting hard but with no valid rDNS are in the range:

157.55.16.0 - 157.55.18.255

AlexK

4:36 am on Jun 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots. Those routines have been augmented to record & report them (both to the tornvall RBL & also on a public webpage as a permanent record). Here's the reports; so far, the same behaviour for 4 days on the trot:

24 June:
65.52.110.75 [forums.modem-help.co.uk] : msnbot-65.52.110.75.search.msn.com : max: 9 / sec : total: 26 pages
207.46.204.238 [forums.modem-help.co.uk] : msnbot-207.46.204.238.search.msn.com : max: 11 / sec : total: 2,214 pages
207.46.199.43 [forums.modem-help.co.uk] : msnbot-207.46.199.43.search.msn.com : max: 8 / sec : total: 548 pages

23 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 3,272 pages

22 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 1,395 pages
65.52.110.64 [forums.modem-help.co.uk] : msnbot-65.52.110.64.search.msn.com : max: 7 / sec : total: 25 pages
65.52.110.72 [forums.modem-help.co.uk] : msnbot-65.52.110.72.search.msn.com : max: 12 / sec : total: 1,888 pages
207.46.195.242 [forums.modem-help.co.uk] : msnbot-207.46.195.242.search.msn.com : max: 12 / sec : total: 1,201 pages

21 June:
65-52-110-72 [forums.modem-help.co.uk] : msnbot-65-52-110-72.search.msn.com : max: 12 / sec : total: 1,424 pages
207-46-195-242 [forums.modem-help.co.uk] : msnbot-207-46-195-242.search.msn.com : max: 12 / sec : total: 926 pages
207-46-199-38 [forums.modem-help.co.uk] : msnbot-207-46-199-38.search.msn.com : max: 8 / sec : total: 26 pages

A full report has also been auto-emailed to the MS abuse address each day for each IP. Frankly, I do not expect any action.

dstiles

9:02 pm on Jun 24, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Had a lot of hits from msnbot recently - coming round on some sites once a day - but behaviour seems reasonable.

That's specific known msnbot UA plus IP, of course. There are always a few non-bot UAs or badly formatted ones, plus the odd bad rDNS IP (your IP examples are good).

AlexK

11:16 pm on Jun 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@dstiles

It's a Microsoft ASN for each IP.

I've seen reports in the past for such behaviour from the msnbot, back in the day before the bingbot, but this is the first time that I've picked it up personally. The stop-block-report routines do not pick up the UA, only info (such as port number) that the hostmaster requires to confirm the abuse. Not that I expect--or have received--anything more than an acknowledgement of receipt from MS (abuse@hotmail.com).

dstiles

9:27 pm on Jun 25, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

I get my info from the site logs (I also record abuses in a separate log): that shows UA and IP. If I suspect a problem I check the rDNS using the linux Network Tools or online robtex.

I've never bothered to report msnbot/bingbot abuse.

AlexK

6:36 am on Jun 26, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstile:
Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

The original Bad-Bot-Blocker [forums.modem-help.co.uk], with added vitamins! Originated on WebmasterWorld [webmasterworld.com] & developed by myself. The bot-blocker is the `stop-block' part. I've added the capability to record the abuse, auto-report (both to abuse email + to RBL site) & provide permanent evidence. I'm a bit proud of it.

The constant scrapes have now almost entirely gone, although that could be because each IP now gets an auto-block. It still beggars the question: what on earth are MS doing?

tangor

6:59 am on Jun 26, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots.


Is Bing a "bad bot"? Just asking. Google is proving short on supply and Bing on the rise... which is the "bad bot"?

Just curious.
This 152 message thread spans 6 pages: 152
 

Featured Threads

Hot Threads This Week

Hot Threads This Month