homepage Welcome to WebmasterWorld Guest from 54.226.173.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 152 message thread spans 6 pages: < < 152 ( 1 2 [3] 4 5 6 > >     
MSN's many cloaked bots. Again.
Pfui




msg:4182832
 11:44 pm on Aug 5, 2010 (gmt 0)

Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

 

Dijkgraaf




msg:4222537
 1:27 am on Oct 27, 2010 (gmt 0)

You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore.

jdMorgan




msg:4222545
 1:52 am on Oct 27, 2010 (gmt 0)

I for one am not yet sure that this is a real msnbot actually controlled by MSN. The "._" at the end of the UA string and the "different" value in the "From" header push me toward "no."

Everyone posting to this thread, please be explicit as to the full UA-string, the "From" header (if you log it), the IP address, and the rDNS if you check it.

Thanks,
Jim

Dijkgraaf




msg:4222552
 2:28 am on Oct 27, 2010 (gmt 0)

In fact two of the topics should be in other threads, as this one started about MSN cloaked bots.
To discuss the MSN bot with the "._" at the end use [webmasterworld.com...]
And maybe start a new thread on msnbot requesting /logs or /access_logs

Pfui




msg:4222604
 5:38 am on Oct 27, 2010 (gmt 0)

@Dijkgraf,

1.) This thread's OP is my observing/complaining about hits from 'bare MSN IPs and not bona fide MSN bots'. Sure, threads overlap but I think subsequent posts are on-point with my OP because they're describing hits from, well, 'bare MSN IPs and not bona fide MSN bots' -- and that includes the iffy bot with the "._"

2.) You said: "add /logs and /access_log to robots.txt and you won't get any more hits on those" and similarly: "You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore."

I wish. But as thousands of posts in this forum attest, simply including dirs and files in robots.txt is definitely not a sure-fire way to eliminate hits to same, neither by bad bots nor, increasingly, 'good' ones.

Bottom line for me --

It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way. :)

Staffa




msg:4222895
 5:47 pm on Oct 27, 2010 (gmt 0)

It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way.

Amen

and that goes for other Bots as well ;o)

Dijkgraaf




msg:4223623
 10:46 pm on Oct 28, 2010 (gmt 0)

@Pfui

But that's the point. You have not told MSN or other bots that they are not allowed to ask for those URL's. With the current standard, anything that isn't explicitly disallowed is allowed. I've seen some very odd requests sometimes, even from GoogleBot including asking for some .exe files. I've put those down to either someone creating fake inbound links to get Google to scan for vulnerabilities or GoogleBot checking for a compromised sites.

Yes, there are bots that won't obey those rules, but those ones you will want to ban outright anyway.

I monitor the 404's occurring on my site, and if a bot starts re-visiting them I disallow it in my robots.txt file or give it a 301 to the resource they should be requesting. Either way it solves the problem, which isn't a big one to start of with, as all that would be happening is that they are getting a 404.

Samizdata




msg:4223648
 12:08 am on Oct 29, 2010 (gmt 0)

all that would be happening is that they are getting a 404

No, the bot also takes the robots.txt file.

If it is a genuine Microsoft bot - and I take Jim's point that the jury is still out on that - then phishing for access logs is disgraceful behaviour for a reputable company.

If it is a fake Microsoft bot using a genuine Microsoft IP then I certainly don't want it reading the robots.txt file or getting any information whatsoever.

Either way, I'd say a 403 is what it deserves.

...

Pfui




msg:4223723
 5:21 am on Oct 29, 2010 (gmt 0)

Jim, about the iffy "._" bot -- see also my observations in this thread's predecessor, "MSN's many cloaked bots." (2009): [webmasterworld.com...] and yours in "Wanted: Crawler Quality Assurance Engineer" (2010): [webmasterworld.com...]

Pfui




msg:4224716
 2:18 am on Nov 1, 2010 (gmt 0)

Doggoneit. MSN just ran bingbot from a bare IP:

157.55.16.229
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

keyplyr




msg:4224734
 3:01 am on Nov 1, 2010 (gmt 0)



MSN just ran bingbot from a bare IP


Thanks for the heads-up Pfui. Guess I was asleep at the wheel and blocked this one.

157.55.16.231 - - [30/Oct/2010:01:50:32 -0700] "GET www.example.com HTTP/1.1" 403 479 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Mokita




msg:4227351
 11:04 pm on Nov 5, 2010 (gmt 0)

Just seen coming from rdns msnbot-65-52-49-143.search.msn.com (65.52.49.143), requests for one page plus css, but no images.

UA was exactly: Mozilla/4.0 (compatible

(no closing bracket)

Pfui




msg:4228756
 9:27 am on Nov 10, 2010 (gmt 0)

Akin to the OP, and also mssg #:4203105, yet another hit-and-run with no UA, no robots.txt, no REF, no nothing. To yet another file. Eleven times. For the third time!

65.52.32.17
-
11/10 00:15:45 /dir/filename.html
11/10 00:15:56 /dir/filename.html
11/10 00:16:07 /dir/filename.html
11/10 00:16:17 /dir/filename.html
11/10 00:16:28 /dir/filename.html
11/10 00:16:39 /dir/filename.html
11/10 00:16:50 /dir/filename.html
11/10 00:17:00 /dir/filename.html
11/10 00:17:11 /dir/filename.html
11/10 00:17:22 /dir/filename.html
11/10 00:17:32 /dir/filename.html

Too weird.

Pfui




msg:4229050
 3:32 am on Nov 11, 2010 (gmt 0)

Same ol', same ol', the dreaded 65.52., albeit a different Host and UA than previously mentioned in this thread:

msnbot-65-52-50-54.search.msn.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707)

robots.txt? NO

Just went for root.

Staffa




msg:4229767
 11:19 pm on Nov 12, 2010 (gmt 0)

I noticed a weird event today.
A regular visitor (via G.fr search) comes and views a few pages.

Next comes 207.46.204.nn (msnbot IP) with exactly the same UA as the previous visitor which was what caught my attention.

UA for both :
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; HPDTDF; Tablet PC 2.0; .NET4.0C; Creative AutoUpdate v1.40.01)

My log files are not public and it is either a mighty coincidence or ... a case where msn is following the visitor from France ?

Pfui




msg:4232259
 6:36 am on Nov 19, 2010 (gmt 0)

Fresh from one of my logs, emphasis mine:

msnbot-65-52-49-143.search.msn.com - - [18/Nov/2010:22:24:15 -0800] "GET / HTTP/1.1" 403 1468 "-" "Mozilla/4.0 (compatible"

Crawlus interruptus?

Mokita




msg:4232311
 10:22 am on Nov 19, 2010 (gmt 0)

Pfui:

Exactly the same details as my post above on Nov 6, (including the rDNS).

Pfui




msg:4232499
 7:46 pm on Nov 19, 2010 (gmt 0)

(slaps head) Thank you for pointing that out:) Curious how it's even the same IP. And BvsB shows the exact same oddity, along with six out of seven UAs not identified as msnbot- or bing-related from 65.52.49.143: [botsvsbrowsers.com...]

Unfortunately that IP's not the only one using the truncated UA. Here are 22 more: [botsvsbrowsers.com...]

Clearly they know what they're doing. But until I know, non-search UAs from .search.msn.com Hosts/IPs will just keep getting 403s.

Transparency. What a concept. (Note the date...) [bing.com:80...]

Pfui




msg:4232679
 9:54 am on Nov 20, 2010 (gmt 0)

More bare IPs from 65.52. Hit in a post-tweet swarm, 20 minutes apart. All using:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.4.249
65.52.2.10
65.52.17.79

robots.txt? NO

Pfui




msg:4235821
 2:24 am on Nov 28, 2010 (gmt 0)

Ten same-second hits to the same site. So much for MSN stating msnbot supports "Crawl-delay" in robots.txt: Crawl Delay And The Bing Crawler, MSNBot [bing.com]

msnbot-65-55-190-9.search.msn.com
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file1

65.55.190.19
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file2
11/27 03:45:34 /file3
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file4
11/27 03:45:34 /file5

Pfui




msg:4242980
 5:40 am on Dec 16, 2010 (gmt 0)

Same Hostname caribguy reported above. At least asked for robots.txt and heeded same. But crawled 20 documents whereas bing/msnbot usually crawls 1 or 2 per server session. Also, the same weird UA was in play:

gig4-2.tuk2f-gsr-a.us.msn.net
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

dstiles




msg:4243326
 9:21 pm on Dec 16, 2010 (gmt 0)

I block all those UAs. Apart from anything else they seem to come from non-rDNS IPs.

Pfui




msg:4254124
 9:09 am on Jan 17, 2011 (gmt 0)

The odd one came around on the 16th --

msnbot-65-55-55-205.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

robots.txt? Yes but... "Crawl-delay" directive ignored.

-- and alternated with the usual suspects in a 15-minute period with hits from Hosts and bare IPs. Here's a partial listing:

msnbot-65-52-110-69.search.msn.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

msnbot-207-46-194-144.search.msn.com
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

157.55.16.230
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

FWIW

incrediBILL




msg:4254400
 10:19 pm on Jan 17, 2011 (gmt 0)

I'm guessing that bingbot from bare IPs is either:

- running from a test lab
- run via some MS cloud services
- being spoofed via some MS proxy

Doesn't matter, if the IP doesn't resolve to a crawler rDNS, I block it.

FYI - these bare IPs are showing up on Project Honeypot: [projecthoneypot.org...]

dstiles




msg:4263929
 9:45 pm on Feb 7, 2011 (gmt 0)

Microsoft seems to have legitimized a few more bot IPs by setting up proper rDNS for them. The range below may not be ALL of the new ones but seems to be correct for 157.55.116 (although the bot is still coming round with the "invalid" underscore UA and ignoring robots.txt on this and other ranges).

157.55.116.7 - 157.55.116.97

Bots still hitting hard but with no valid rDNS are in the range:

157.55.16.0 - 157.55.18.255

AlexK




msg:4330348
 4:36 am on Jun 24, 2011 (gmt 0)

4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots. Those routines have been augmented to record & report them (both to the tornvall RBL & also on a public webpage as a permanent record). Here's the reports; so far, the same behaviour for 4 days on the trot:

24 June:
65.52.110.75 [forums.modem-help.co.uk] : msnbot-65.52.110.75.search.msn.com : max: 9 / sec : total: 26 pages
207.46.204.238 [forums.modem-help.co.uk] : msnbot-207.46.204.238.search.msn.com : max: 11 / sec : total: 2,214 pages
207.46.199.43 [forums.modem-help.co.uk] : msnbot-207.46.199.43.search.msn.com : max: 8 / sec : total: 548 pages

23 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 3,272 pages

22 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 1,395 pages
65.52.110.64 [forums.modem-help.co.uk] : msnbot-65.52.110.64.search.msn.com : max: 7 / sec : total: 25 pages
65.52.110.72 [forums.modem-help.co.uk] : msnbot-65.52.110.72.search.msn.com : max: 12 / sec : total: 1,888 pages
207.46.195.242 [forums.modem-help.co.uk] : msnbot-207.46.195.242.search.msn.com : max: 12 / sec : total: 1,201 pages

21 June:
65-52-110-72 [forums.modem-help.co.uk] : msnbot-65-52-110-72.search.msn.com : max: 12 / sec : total: 1,424 pages
207-46-195-242 [forums.modem-help.co.uk] : msnbot-207-46-195-242.search.msn.com : max: 12 / sec : total: 926 pages
207-46-199-38 [forums.modem-help.co.uk] : msnbot-207-46-199-38.search.msn.com : max: 8 / sec : total: 26 pages

A full report has also been auto-emailed to the MS abuse address each day for each IP. Frankly, I do not expect any action.

dstiles




msg:4330766
 9:02 pm on Jun 24, 2011 (gmt 0)

Had a lot of hits from msnbot recently - coming round on some sites once a day - but behaviour seems reasonable.

That's specific known msnbot UA plus IP, of course. There are always a few non-bot UAs or badly formatted ones, plus the odd bad rDNS IP (your IP examples are good).

AlexK




msg:4330817
 11:16 pm on Jun 24, 2011 (gmt 0)

@dstiles

It's a Microsoft ASN for each IP.

I've seen reports in the past for such behaviour from the msnbot, back in the day before the bingbot, but this is the first time that I've picked it up personally. The stop-block-report routines do not pick up the UA, only info (such as port number) that the hostmaster requires to confirm the abuse. Not that I expect--or have received--anything more than an acknowledgement of receipt from MS (abuse@hotmail.com).

dstiles




msg:4331082
 9:27 pm on Jun 25, 2011 (gmt 0)

Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

I get my info from the site logs (I also record abuses in a separate log): that shows UA and IP. If I suspect a problem I check the rDNS using the linux Network Tools or online robtex.

I've never bothered to report msnbot/bingbot abuse.

AlexK




msg:4331153
 6:36 am on Jun 26, 2011 (gmt 0)

dstile:
Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

The original Bad-Bot-Blocker [forums.modem-help.co.uk], with added vitamins! Originated on WebmasterWorld [webmasterworld.com] & developed by myself. The bot-blocker is the `stop-block' part. I've added the capability to record the abuse, auto-report (both to abuse email + to RBL site) & provide permanent evidence. I'm a bit proud of it.

The constant scrapes have now almost entirely gone, although that could be because each IP now gets an auto-block. It still beggars the question: what on earth are MS doing?

tangor




msg:4331157
 6:59 am on Jun 26, 2011 (gmt 0)

4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots.


Is Bing a "bad bot"? Just asking. Google is proving short on supply and Bing on the rise... which is the "bad bot"?

Just curious.

Mokita




msg:4331164
 8:40 am on Jun 26, 2011 (gmt 0)

@AlexK

Bingbot/MsnBot claim to honour a "Crawl-delay" directive in robots.txt :

[bing.com...]

Did you try that method before/instead of using the sledge-hammer approach?

If you had, your complaints to MS Abuse would probably carry more weight.

... Just a suggestion (as I haven't experienced the abuse you mention but I do have a "Crawl-delay" setting)

<edit> BingBot is crawling our sites heavily - but honouring the Crawl-delay </edit>

This 152 message thread spans 6 pages: < < 152 ( 1 2 [3] 4 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved