homepage Welcome to WebmasterWorld Guest from 107.21.163.227
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 58 message thread spans 2 pages: 58 ( [1] 2 > >     
msnbot-media
What's a nice robot like you doing in a place like this?
lucy24




msg:4470275
 9:53 pm on Jun 27, 2012 (gmt 0)

Stop me if you've heard this one. While experimenting with an alternative log-wrangling script I ran smack dab into:

131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /hovercraft/images/kabloona.jpg HTTP/1.1" 200 44328 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /hovercraft/caribou.html HTTP/1.1" 200 10970 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


and

131.253.41.223 - - [26/Jun/2012:07:53:18 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.223 - - [26/Jun/2012:07:53:18 -0700] "GET /hovercraft/images/yesno.jpg HTTP/1.1" 200 38878 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.223 - - [26/Jun/2012:07:53:19 -0700] "GET /hovercraft/caribou.html HTTP/1.1" 200 10970 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


That is obviously The Real Thing; I'd recognize that pattern anywhere. robots.txt, one image, page the image lives on. For comparison purposes, the same day's logs include

207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET /images/perez.jpg HTTP/1.1" 200 5781 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET / HTTP/1.1" 200 2180 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


But what the bleep bleep is 131.253? We've met 131.107.0; there have been occasional threads about it, most recently in March 2012 [webmasterworld.com].

Turns out 131.253.21-47 (really: I checked the adjacent numbers on both sides) belongs to Microsoft. Somewhere along the line they must have subleased it from the company that owns the rest of the 131.253 block. Further cursory research tells me I have never* met this address before.

What gives? Anyone else seen recent visits from this neighborhood?


* I didn't bother to unzip & check older logs, so "never" = within the past year.

 

keyplyr




msg:4470295
 11:43 pm on Jun 27, 2012 (gmt 0)

urns out 131.253.21-47 (really: I checked the adjacent numbers on both sides) belongs to Microsoft.


Yes, for years now. My notes say I added that range in 2007.

wilderness




msg:4470324
 3:00 am on Jun 28, 2012 (gmt 0)

lucy,
FWIW, this bot has never been compliant and past-practices were/are denial.

lucy24




msg:4470340
 5:19 am on Jun 28, 2012 (gmt 0)

:: detour here for closer look at past month's logs ::

Don, I think you must just have prettier pictures than me ;) No activity in any roboted-out directory-- and there are a lot more of them than there were the last time I looked closely at this robot. In particular, I've Disallowed about half the e-books-- and it didn't visit any that it wasn't supposed to.

Odd corollary discovery: Within the /paintings/ directory, all msnbot-media activity involved blowups linked in the form <a href = "filename.jpg" target = "_blank">. (Logs don't say so, of course, but I can tell from the page names. This kind of linking doesn't occur in any other directory.) Overall, about half the full-size jpgs are linked like that while the other half are embedded in individual pages. And then for every big picture there's a thumbnail, 1/4 linear size; the robot wasn't interested in any of those.

That made me curious enough to pull up another months' raw logs, looking only at /paintings/. Same pattern. Hm. Wonder what they're up to?

dstiles




msg:4470670
 8:59 pm on Jun 28, 2012 (gmt 0)

I found 131.253.41.nnn today for the first time (my records on this only go back to March 2010). It hit with the media bot. I left the range "open" for now to see what else it throws up.

grandma genie




msg:4473511
 5:59 pm on Jul 7, 2012 (gmt 0)

Came in today for the first time. I have disallowed it in robots.txt. Don't want my media indexed.

131.253.41.nn - - [07/Jul/2012:04:54:47 -0400] "GET /robots.txt HTTP/1.1" 200 1500 "-" "msnbot-media/1.1 (+h**p://search.msn.com/msnbot.htm)"

wilderness




msg:4522166
 11:57 am on Nov 24, 2012 (gmt 0)

I've had a glitch which prevented this IP and msnbot-media/1.1 from same IP from seeing robots.txt and the requests (and 403s) escalated.

The others MSN IP's that were making requests for msnbot-media/1.1 could see robots.txt and complied with omission requests.

Why on earth the other msnbot-media/1.1 IIP ranges could not relay same compliance to the 131.253 range is beyond comprehension. (the head doesn't know what the arms and legs are doing at MSN?)

In any event I located the glitch and the 131.253 requests have ceased, however it's still puzzling that msnbot-media/1.1 is coming from a quantity of ranges and/or departments:

131.253.41.60 - - [24/Nov/2012:04:45:49 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
65.52.109.114 - - [24/Nov/2012:08:18:34 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.194.70 - - [24/Nov/2012:09:33:29 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

still more:
207.46.194.114 - - [24/Nov/2012:11:09:38 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.194.127 - - [24/Nov/2012:11:10:18 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
65.52.109.126 - - [24/Nov/2012:11:19:25 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
65.52.109.39 - - [24/Nov/2012:11:37:36 +0000] "GET /robots.txt HTTP/1.1" 200 2719 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

wilderness




msg:4523363
 9:37 pm on Nov 28, 2012 (gmt 0)

fourty-seven (47) compliant requests for robots.txt in one 24-hour period.

wilderness




msg:4524159
 3:27 pm on Dec 1, 2012 (gmt 0)

131.253. been acting weird.

131.253.39. gets 403's, while the remainder of the assignments are allowed.

The only thing I see is that temporary and incomplete header check that I installed. I've remarked out the header check and will report back what transpires.

lucy24




msg:4526055
 5:04 am on Dec 8, 2012 (gmt 0)

:: further bump ::

Stop the bleepin presses. This just in. Verbatim from logs with a minimum of snips. The Forums will probably eat the distinctive double-spaces in each UA.

131.253.36.202 - - [07/Dec/2012:11:31:28 -0800] "GET /fonts/naamajut.html HTTP/1.1" 200 4774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727)"
131.253.36.202 - - [07/Dec/2012:11:31:29 -0800] "GET /piwik/piwik.js HTTP/1.1" 200 21927 "http://www.example.com/fonts/naamajut.html" {same}
131.253.36.202 - - [07/Dec/2012:11:31:30 -0800] "GET /sharedstyles.css HTTP/1.1" 200 2984 {et cetera}
131.253.36.206 - - [07/Dec/2012:11:31:31 -0800] "GET /fonts/fontstyles.css HTTP/1.1" 200 3191 {et cetera}
131.253.36.205 - - [07/Dec/2012:11:31:33 -0800] "GET /piwik/piwik.php?action_name=Naamajut& {et cetera}


131.253.26.244 - - [07/Dec/2012:13:08:06 -0800] "GET /fonts/legacy.html HTTP/1.1" 200 10247 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
131.253.26.244 - - [07/Dec/2012:13:08:06 -0800] "GET /piwik/piwik.js HTTP/1.1" 200 21927 "http://www.example.com/fonts/legacy.html" {same}
131.253.26.244 - - [07/Dec/2012:13:08:07 -0800] "GET /sharedstyles.css HTTP/1.1" 200 2984 {et cetera}
131.253.26.244 - - [07/Dec/2012:13:08:07 -0800] "GET /fonts/fontstyles.css HTTP/1.1" 200 3190 {et cetera}
65.55.212.65 - - [07/Dec/2012:13:08:13 -0800] "GET /piwik/piwik.php?action_name=Legacy%20Fonts& {et cetera}


Note the dutiful collection of all associated files-- except images (7 on one page, 31 on the other).

I checked previous days: this is brand-new. If it hadn't been for that sudden IP swap at the end I wouldn't even have noticed-- all that css and js activity was enough to make it pass for human. I'd never got around to blocking the plainclothes bingbot from this address. Never thought of it, in fact.

Notice the piwik queries? Normally with robots it's simply "idsite=1" identifying the domain. This was a full-blown query string in every detail, meaning that the visitor executed the preceding javascript and sent humanoid information. Most of it is so much Hungarian to me, but the very last item is

&res=800x600

Is it now. Fancy that. And what are we to make of that six-second pause between the last two requests? Is that Robot A (131.253.) signing off and handing its unfinished jobs to Robot B (65.55)?



Tangential discovery made while investigating this one: around the end of August, msnbot-media abruptly dropped its old tripartite system-- the one I described at the beginning of this thread that goes "robots.txt, one image, page". It went on a couple of robots.txt binges-- my record seems to be 34 in a 24-hour period, still no match for the ordinary bingbot-- and has now settled into one or two robots.txt alternating with one image. Generally a very minor gif.

File under: wtf?

wilderness




msg:4526057
 5:13 am on Dec 8, 2012 (gmt 0)

Hey lucy,
I've a lingering domain (for lack of a better word).
There's not very much on the site.

131.253.36.135 - - [04/Dec/2012:19:20:07 +0000] "GET /robots.txt HTTP/1.1" 200 1696 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

This was the solitary request, and the robots.txt does not include msnbot-media

I haven't had any subsequent requests from the 39 Class C, and thus I reactivated the header checks.

Maybe it's just some kind of bing glitch.

lucy24




msg:4526397
 10:10 pm on Dec 9, 2012 (gmt 0)

Make it a multiple glitch, if so. They came back later in the night-- before I'd got around to uploading the revised .htaccess-- for another page. And tried once more still later, coming away with only the custom 403 page and /boilerplate/errorstyles.css. By then it was 1:30 AM Pacific time, which is not a likely hour for humans at Microsoft to be working late ;)

lucy24




msg:4527887
 7:57 pm on Dec 13, 2012 (gmt 0)

:: bump ::

Do we have an ongoing thread about the plainclothes bingbot? Things are getting a bit tangled.

In today's "D'oh!" moment I went over to the Bing WMT discussion boards in search of enlightenment. (Link will probably only work if you are signed on to bing.)

From March 2010 [bing.com], final post in a long thread, from a non-bing-affiliated human:

It turns out that the traffic you're seeing isn't really the MSNBot search indexer - it's Bing Translator (AKA Microsoft Translator / Windows Live Translator).

If a user crawls your site and then translates the page into their local language through this tool then you will see the request coming from a 65.55 IP address which MAY (not always) reverse DNS to say "msnbot". However it's a real human requesting this page, and you should not really attempt to block it unless "msnbot" is in the user-agent string.

The translate server is proxying the request and you will therefore see the user's user-agent string - not the MSNBOT one.

It seems microsoft are repurposing IP addresses and not updating the reverse DNS names for them, so many translate server IP addresses reverse lookup to a MSN bot address.

There are several different ways of using their service - you can use Page > Translate with Live Search in IE8, or from the Windows Live Toolbar, or you can click "translate this page" from a bing.com search result screen.

For people whose approach to translators is Shoot To Kill, this makes things easy.

Problem is, I am not sure I believe it. I pored over logs from the visit I fortuitously quoted above, and there were no human requests for images during the relevant time period-- and both of those pages come with lots of keyboard diagrams. In fact the only related image requests from around that time came from the YandexBot, whose attention seems to have been caught by a Yandex search. And if Yandex is in collusion with Bing Translate it is definitely Stop The Presses time.

Next I got Bing to dig up one of my non-English pages-- which was not easy, in spite of the <lang> tags that they claim to recognize-- and asked for a translation. Logs say:

131.253.36.194 - - [13/Dec/2012:11:48:18 -0800] "GET /ebooks/perez/PerezEsp.html HTTP/1.1" 200 12818 "-" "{my browser here}"

All associated files-- CSS, images etc-- are logged as:

{my IP} - - [13/Dec/2012:11:48:19 -0800] "GET /ebooks/ebookstyles.css HTTP/1.1" 200 2999 "http://131.253.14.66/proxy.ashx?h=KJ1kesesgM4REwrijiOEFN1Hnv3Pe7dI&a={percent-encoded filename}" "{my browser}"

-- a format guaranteed to yield a page rich with No Hotlinks images.

So... Nice try, but I don't think it's the answer. At least not this month, this year. There may be more recent Bing threads that I couldn't find. If not, I guess the next step is to try asking again. Noteworthy that in the earlier thread, nobody from Bing/MSN stepped in to explain.

dstiles




msg:4527897
 8:24 pm on Dec 13, 2012 (gmt 0)

If any SE (eg G or B or Y) offers translation then the correct method is to ask for a page via a suitably-identified proxy and include the requester's UA and IP, as is normal for a proxy. If they try to ride on the back of a defined bot IP they are asking to be rejected.

It's (mainly) because G, for example, uses any old IP with no reasonable rDNS that I block G translates. If an SE comes in on a bot IP with a non-bot UA they will also get rejected. It's not as if these companies are lacking in IPs - they have thousands of the things!

Of course, it's odds-on that the SE already has the page in its SE database, so why do they need to visit with a translator in the first place? Why not just regurgitate their own scrape, which would probably be quicker and take fewer resources.

keyplyr




msg:4527901
 8:28 pm on Dec 13, 2012 (gmt 0)

Yet another reason to block all translators! If your biz needs alternative language support, host those translated pages on your own server.

lucy24




msg:4527946
 10:14 pm on Dec 13, 2012 (gmt 0)

Yet another reason to block all translators!

Note however that the word "translate" or "translator" never appears anywhere in the logs so you can't block it by name. And that's assuming for the sake of discussion that the plainclothes MSNbot really is a translator, which I'm not at all sure of. I'm keeping it blocked pending solid proof. (I don't object to translators.)

This in turn reminds me that That Other Search Engine has taken to omitting the word "translate" from the part added to the UA string. I've been getting visits with only the "gzip" component added. Referer info confirms that it's still translation, not some suspicious new activity.

I've made non-English versions of my two most commonly translated pages. (A specific language for each.) But frankly it felt suicidal, since it basically means cutting each page's search numbers in half :(

keyplyr




msg:4527962
 11:46 pm on Dec 13, 2012 (gmt 0)

I've made non-English versions of my two most commonly translated pages... it basically means cutting each page's search numbers in half

Shouldn't if done correctly. Maybe you're just noticing the absence of those translation crawls for the two pages you made?

Give each language it's own unique address or put it in a sub directory, linked from the main page, then disallow indexing. Shouldn't change the SERP that way.

Or the other static way is to allow indexing and let those pages display on their own in those language specific SERP.

Best scenario would obviously be to create those alternative language pages dynamically using a DB and an in-house translation utility.

Either way, amounts to the same thing. Mine have not affected my ranking at all.

lucy24




msg:4527982
 1:27 am on Dec 14, 2012 (gmt 0)

Maybe you're just noticing the absence of those translation crawls for the two pages you made?

Give each language it's own unique address or put it in a sub directory, linked from the main page, then disallow indexing. Shouldn't change the SERP that way.

The former two pages are now four pages. Everything is separately indexed. So half of the people who used to arrive at page X accompanied by some kind of translation request now go directly to page X-Esp. And half of the people who used to go to page Y now go instead to page Y-It. This batch usually didn't ask for a translation, but so many came from Italy that I gave them their own page. (I suspect that some of them don't have full faith in the translation, because they look at-- or at least open-- both :))

As an added quirk: The Spanish page is actually the original text, of which the English was a translation. But the translation had much prettier pictures than the Spanish original, so I made a self-confessed Frankenbook.* The Italian page started with G### translate, got cleaned up a little bit by me and then properly by an Italian. ("When you say 'chiave' can I assume you mean 'tasto'?" or "The construction with 'soli' [which I could swear I learned in school] would have been fine in 1920 and is still grammatically correct, but nobody says it that way." Eccetera.)


* Technical term.

thetrasher




msg:4528089
 2:24 pm on Dec 14, 2012 (gmt 0)

This in turn reminds me that That Other Search Engine has taken to omitting the word "translate" from the part added to the UA string.

That Other Search Engine sends Via ("Via: 1.0 translate.google.com TWSFE/0.9") and X-Forwarded-For. Bing does not.

blend27




msg:4528127
 4:39 pm on Dec 14, 2012 (gmt 0)

Well, for one of my sites it seems like it started in 01-Nov-2012. This "compacted" dataset shows that different flavors of what seems to be IE7 from:

131.253.24.nnn
131.253.26.nnn
131.253.36.nnn
131.253.38.nnn

as mentioned before, the CSS and JS files are downloaded as well, the referrer provided appropriately, but not the images. Robots.txt file was not requested by any of mentioned below. Ecom Site. Requests are targeting pages that have product descriptions on them.

-----------------------------------------------------------------------------------------
2012-12-14 08:37:19.447 - 131.253.26.255 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707)
2012-12-14 01:54:37.393 - 131.253.24.128 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
2012-12-13 14:44:25.340 - 131.253.24.131 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-13 13:52:29.877 - 131.253.36.202 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
2012-12-13 13:45:14.843 - 131.253.36.197 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707)
2012-12-13 12:43:27.140 - 131.253.26.226 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
2012-12-13 12:05:53.787 - 131.253.26.233 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-13 11:25:56.943 - 131.253.24.150 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
2012-12-13 09:33:19.997 - 131.253.36.197 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-13 08:26:15.720 - 131.253.24.141 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-13 03:54:56.883 - 131.253.24.144 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727)
2012-12-13 03:10:12.420 - 131.253.36.200 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727)
2012-12-13 02:26:15.313 - 131.253.24.130 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
2012-12-12 22:27:36.210 - 131.253.24.151 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-12 19:47:47.253 - 131.253.36.204 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-12 15:11:55.880 - 131.253.24.133 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-12 14:14:49.460 - 131.253.26.231 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1)
2012-12-12 09:11:51.217 - 131.253.26.232 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
2012-12-12 04:56:55.743 - 131.253.24.136 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
2012-12-12 04:30:23.843 - 131.253.26.249 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-12 01:32:24.007 - 131.253.26.230 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-12 00:55:01.580 - 131.253.26.255 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-11 17:21:25.393 - 131.253.26.238 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1)
2012-12-11 10:02:23.043 - 131.253.24.158 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707; InfoPath.2)
2012-12-11 02:28:32.473 - 131.253.26.225 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
2012-12-10 17:37:50.323 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
2012-12-10 14:28:57.077 - 131.253.24.153 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-10 03:29:43.270 - 131.253.36.197 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607)
2012-12-09 18:04:13.223 - 131.253.24.154 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
2012-12-09 17:37:24.997 - 131.253.26.232 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)
2012-12-09 17:16:24.897 - 131.253.24.148 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707; InfoPath.2)
2012-12-09 11:46:05.007 - 131.253.24.151 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-09 03:57:49.447 - 131.253.24.150 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-09 01:18:50.337 - 131.253.24.156 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729)
2012-12-08 23:33:02.073 - 131.253.26.231 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-08 21:59:09.773 - 131.253.24.141 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-08 21:42:57.883 - 131.253.24.158 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
2012-12-08 21:40:38.590 - 131.253.26.224 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729)
2012-12-08 19:44:28.330 - 131.253.26.253 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-08 16:16:15.100 - 131.253.24.142 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-08 13:52:11.617 - 131.253.24.149 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30707)
2012-12-08 13:33:18.543 - 131.253.36.200 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-08 12:40:17.157 - 131.253.26.230 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
2012-12-08 08:48:16.660 - 131.253.26.222 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-08 07:45:26.550 - 131.253.24.159 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729)
2012-12-08 04:07:54.713 - 131.253.26.229 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)
2012-12-08 03:48:08.217 - 131.253.26.239 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)
2012-12-08 02:40:16.970 - 131.253.24.154 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
2012-12-08 01:09:03.220 - 131.253.24.153 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.30729)
2012-12-07 23:57:09.923 - 131.253.24.154 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729)
2012-12-07 20:14:53.127 - 131.253.24.145 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729)
2012-12-07 17:22:34.877 - 131.253.26.254 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
2012-12-02 21:28:52.697 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
2012-11-25 13:54:42.900 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
2012-11-17 08:47:04.720 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
2012-11-08 15:03:47.863 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
2012-11-01 11:53:26.340 - 131.253.38.67 - Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)

I'll dig into Headers sent a bit later.

blend27




msg:4530650
 7:29 pm on Dec 23, 2012 (gmt 0)

Ooooops,

131.253.24.28 with UA: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Just got caught trying to fake a referrer, specifying it as none WWW version of the root document.

Another question is why would anything MSF have anything remotely to do with Apple, even as simple as in the UA String?

keyplyr




msg:4530652
 8:22 pm on Dec 23, 2012 (gmt 0)

Another question is why would anything MSF have anything remotely to do with Apple, even as simple as in the UA String?

For unchallenged support on Apple devices. Google's Chrome and the native Android browser do the same thing.

wilderness




msg:4530659
 9:16 pm on Dec 23, 2012 (gmt 0)

fails header check

131.253.38.21 - - [23/Dec/2012:17:42:01 +0000] "GET /robots.txt HTTP/1.1" 403 559 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

lucy24




msg:4530695
 1:42 am on Dec 24, 2012 (gmt 0)

Well, no wonder msnbot-media isn't robots.txt compliant. You won't let it see robots.txt in the first place :)

<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

wilderness




msg:4530701
 3:09 am on Dec 24, 2012 (gmt 0)

Hey lucy,
I'm not counting the msnbot-media requests for a day again (see earlier reply this thread).

The only msnbot-media robots.txt request that fail are the failed headers checks, which only come from these 3x Class three ranges. All the others get through.

not2easy




msg:4530712
 6:21 am on Dec 24, 2012 (gmt 0)

I am also seeing MS bots disobeying robots.txt and landing in my trap this week: 168.61.12.45
Lookup shows Microsoft NetRange: 168.61.0.0 - 168.63.255.255
168.61.0.0/16, 168.62.0.0/15

wilderness




msg:4530725
 8:41 am on Dec 24, 2012 (gmt 0)

easy,
This is likely a rogue using a FAKE UA.

I recall your doing some headers checks, what happened on these IP's?

lucy24




msg:4530726
 9:00 am on Dec 24, 2012 (gmt 0)

### !

Are there any more MS ranges that we've never heard of in our entire lives?

The only msnbot-media robots.txt request that fail are the failed headers checks

So that's where we differ. I let everyone get robots.txt, no matter how evil. But then, I don't have one of those fancy rewrites that shows a different file to every robot so they can't switch UAs and come back dressed up as Good Robot.

wilderness




msg:4530729
 9:08 am on Dec 24, 2012 (gmt 0)

So that's where we differ.


guess I just never found a cordial way to say *-off ;)

lucy24




msg:4530781
 1:29 pm on Dec 24, 2012 (gmt 0)

By the time the Word Censors get through with it, it will be cordial whether you like it or not :-P

This 58 message thread spans 2 pages: 58 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved