homepage Welcome to WebmasterWorld Guest from 54.237.54.83
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 58 message thread spans 2 pages: < < 58 ( 1 [2]     
msnbot-media
What's a nice robot like you doing in a place like this?
lucy24




msg:4470275
 9:53 pm on Jun 27, 2012 (gmt 0)

Stop me if you've heard this one. While experimenting with an alternative log-wrangling script I ran smack dab into:

131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /hovercraft/images/kabloona.jpg HTTP/1.1" 200 44328 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.45 - - [26/Jun/2012:06:20:22 -0700] "GET /hovercraft/caribou.html HTTP/1.1" 200 10970 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


and

131.253.41.223 - - [26/Jun/2012:07:53:18 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.223 - - [26/Jun/2012:07:53:18 -0700] "GET /hovercraft/images/yesno.jpg HTTP/1.1" 200 38878 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
131.253.41.223 - - [26/Jun/2012:07:53:19 -0700] "GET /hovercraft/caribou.html HTTP/1.1" 200 10970 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


That is obviously The Real Thing; I'd recognize that pattern anywhere. robots.txt, one image, page the image lives on. For comparison purposes, the same day's logs include

207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET /robots.txt HTTP/1.1" 200 533 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET /images/perez.jpg HTTP/1.1" 200 5781 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
207.46.199.163 - - [26/Jun/2012:08:50:38 -0700] "GET / HTTP/1.1" 200 2180 "-" "msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"


But what the bleep bleep is 131.253? We've met 131.107.0; there have been occasional threads about it, most recently in March 2012 [webmasterworld.com].

Turns out 131.253.21-47 (really: I checked the adjacent numbers on both sides) belongs to Microsoft. Somewhere along the line they must have subleased it from the company that owns the rest of the 131.253 block. Further cursory research tells me I have never* met this address before.

What gives? Anyone else seen recent visits from this neighborhood?


* I didn't bother to unzip & check older logs, so "never" = within the past year.

 

not2easy




msg:4530794
 2:46 pm on Dec 24, 2012 (gmt 0)

@ wilderness - I don't check headers on traffic. I did not go back to download the logs so I did not see the UA, not even sure which domain that came in from without further digging. When I get a notification of an IP in a trap I do a whois to get the CIDR to add to the master list and that one came from Microsoft. I don't do logs every day unless something startling needs attention. It will get matched up with a UA when I check logs.

wilderness




msg:4531107
 11:16 am on Dec 26, 2012 (gmt 0)

I'm not counting the msnbot-media requests for a day again


Slow activity from visitors in the past twelve hours.

Sixty-eight (68) requests for robots.txt from the msnbot-media/1.1 in a ten-hour period, and on one site.

My sites did not allow crawling of images for more than a decade and it appears MSN/Bing believes I'm going to change my methods every 10-15 minutes.

not2easy




msg:4534532
 5:03 am on Jan 9, 2013 (gmt 0)

Bing's verify page says some of these are NOT verified Bingbot IP addresses. If you follow the URL in the UA you end up with Bing's verify tool. It does not have any equivalent tool for the msnbot but by the same token they do admit to 65.52.109.114 while disavowing 131.253.26.233, 131.253.36.194, 131.253.26.242 and 131.253.24.128 with:
Verdict for IP address 131.253.26.233:
No - this IP address is NOT a verified Bingbot IP address.

I did not check all the IPs listed, their captcha stinks and I can't sign in on FF due to their stupid geo-language fixation. I know it shows up as Microsoft's range, but they are telling people that it is not them. ?.

wilderness




msg:4534591
 11:09 am on Jan 9, 2013 (gmt 0)

Checked logs on three domains going back through August 2012.

NOT a solitary request from 131.253.2X

thus, I'm changing my ranges for 131.253.

dstiles




msg:4534796
 10:15 pm on Jan 9, 2013 (gmt 0)

I have the IP ranges below enabled for msnbot. I ONLY allow the official bingbot UA on these ranges so if (eg) a beta bot, browser, mediabot etc hits on the ranges they are rejected.

These ranges were checked last summer using Microsoft's own DNS servers to download the rDNS values but are probably not complete (I didn't check all MS IP ranges - takes far too long). Not all IPs in the range carry bot identification but enough to make it efficient; I currently have about 200 disabled sub-ranges which were too many to manage.

If you run linux or a system that can run dig then it's easy enough to hunt the IPs down given a starting point or three.

64.4.13.19364.4.13.243
64.4.50.13864.4.50.227
64.4.54.1364.4.54.89
64.4.54.13864.4.54.180
64.4.54.20064.4.54.227
65.52.0.165.52.255.254
65.54.164.3665.54.164.135
65.54.165.765.54.165.7
65.54.165.3265.54.165.74
65.54.165.9765.54.165.111
65.55.0.165.55.255.254
65.55.25.065.55.25.255
131.253.24.0131.253.27.255
131.253.36.128131.253.36.255
131.253.38.0131.253.38.255
131.253.46.0131.253.47.255
157.55.0.1157.55.255.254
157.56.0.1157.56.255.254
207.46.0.0207.46.255.255

lucy24




msg:4534798
 10:25 pm on Jan 9, 2013 (gmt 0)

Have you ever got a request from 131.253.anything other than the bingbot range? If not, it's probably simpler to block the whole 131.253 package. The other pieces are some corporation-- we may even have talked about them earlier in this thread-- but probably not the kind whose employees go shopping for your particular widgets on their lunch break ;)

:: detour for spot checking ::

131.253.26.225 - - [02/Jan/2013:07:12:29 -0800] "GET /paintings/refrats/nochilles.html HTTP/1.1" 403 928 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)"

131.253.26.251 - - [03/Jan/2013:09:16:26 -0800] "GET /ebooks/ninelives/NineLives.html HTTP/1.1" 403 928 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)"


That's the plainclothes bingbot. You probably can't tell in the Forums, but each semicolon is followed by two spaces.

Oh, and .24 is also used by Bing Preview, under the name

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

(I'm sure someone somewhere has pointed out the hilarity of a MSN function using an AppleWebKit product :)) I found that one on the 3rd.


Stop the presses: This Just In. Landed on it while looking for-- duh-- something else.

main file:
131.253.36.194 - - [13/Dec/2012:11:48:18 -0800] "GET /ebooks/perez/PerezEsp.html HTTP/1.1" 200 12818 "-" "{my browser}"

all subsidiary files:
{my IP} - - [13/Dec/2012:11:48:19 -0800] "GET /ebooks/perez/perezstyles.css HTTP/1.1" 200 2668 "http://131.253.14.66/proxy.ashx?h=KJ1kesesgM4REwrijiOEFN1Hnv3Pe7dI&a={full URL of main page}" "{my browser}"

I draw your attention to the proxy IP. That's completely outside the range we'd flagged as MSN/Bing. DomainTools says it's 131.253.12.0 - 131.253.18.255 (131.253.12.0/22 131.253.16.0/23 131.253.18.0/24 in CIDR-speak), belonging to Bing Translate.

:: sigh ::

blend27




msg:4534913
 4:03 am on Jan 10, 2013 (gmt 0)

(I'm sure someone somewhere has pointed out the hilarity of a MSN function using an AppleWebKit product :)

Wasn't me: [youtube.com...]

---------------------
compatible; MSIE 7.0; Windows NT 5.1 - XP 32 Bit
compatible; MSIE 7.0; Windows NT 5.2; SV1; - 64 bit

IE 7 & XP when when MS is promoting Win8/IE10 like there is no tomorrow.

I could confirm the double space in UA from My Logs.

I am about to nuke 131.253., i dont see reason why not...

dstiles




msg:4535128
 9:21 pm on Jan 10, 2013 (gmt 0)

The way my system works it's not a big difference between checking for a valid bot UA (on a valid bot IP) to checking a total ban. Under the circumstances I'll go with the DNS lookups and leave those ranges open. :)

lucy24




msg:4535162
 1:01 am on Jan 11, 2013 (gmt 0)

My post was actually a response to wilderness, right above yours. You just slipped in and typed faster :)

As I understand it, wilderness's system is:

Me user. You mod_rewrite.
Me make rules. You obey.

I think he's even got mod_rewrite taking out the garbage and cleaning up the widget droppings.

My own "system"* is pretty much the same except that it also uses mod_setenvif and mod_auth-whatever-it-is. And I get along better with CIDR ranges, on account of having learned binary at an early age.


* You can apply sarcastic quotes to yourself. Who's going to be offended? ;)

wilderness




msg:4535165
 1:44 am on Jan 11, 2013 (gmt 0)

cleaning up the widget droppings.


the mushroom farms pay good money for that stuff.

wilderness




msg:4535167
 1:49 am on Jan 11, 2013 (gmt 0)

My own "system"* is pretty much the same except that it also uses mod_setenvif and mod_auth-whatever-it-is. And I get along better with CIDR ranges, on account of having learned binary at an early age.


FWIW, I converted some of my mod_rewrite IP's to mod_auth (deny from) and CIDR's, despite the CIDR looking like pure gibberish and requiring six calculators and four website to convert to a comprehensible ten-finger-ten-toe format.

The reduction in file size thus far was 25k+ or an approximate 20%.

keyplyr




msg:4535177
 3:27 am on Jan 11, 2013 (gmt 0)

Don, a handy free tool to convert IP ranges to CIDR specification is IPRange2CIDR found here: [kgsoft.com...] Just DL and install on your machine:

wilderness




msg:4535310
 4:08 pm on Jan 11, 2013 (gmt 0)

keyplr,
Many thanks I've downloaded the tool and will install and text.

Unfortunately my whining is not related to to the actual conversion of the IP's to CIDR, rather in my own inability to comprehend the CIDR ranges (i. e., gibberish) into something that I readily and visually I understand.

I'm able to create and understand afterwards (visually) entire lines of IP's in mod-rewrite-regex faster than I'm able to use a tool to convert a CIDR into recognizable range.

That inability is what lucy was referring to in binary comprehension.

lucy24




msg:4535402
 3:39 am on Jan 12, 2013 (gmt 0)

:: drifting o/t ::

I was trying to think of an analogy for the 123.123.0.0/16 * notation and the closest I could come is this:

Suppose your personnel department's computer refuses to accept blank spaces, so there's no such thing as a five-figure salary. Any gaps have to be filled with leading zeros, like $052,986 or $098,250. Or $000,060 for that guy who came in one day to change a few lightbulbs when the janitor was off. (Let's, uhm, assume for the sake of discussion that if you take home more than $999,999 a year, the extra is not on your paycheck but somewhere else.)

Now you've got:

050,000/2 = first two digits have to be the same, so anyone whose paycheck is in the range $50,000 through $59,000.
050,000/3 = anyone in the range $50,000 through 50,999.
050,000/4 = now you're in the range where you can compare paychecks without making anyone mad, since the difference is at most $99.

050,000/1 = anyone who makes less than $100,000, whether it's the $50 lightbulb guy or the $098,000 management trainee. In fact the 5 in second place is meaningless; the correct form becomes 000,000/1.

The dots in your CIDR ranges aren't decimal points. They're equivalent to the commas separating thousands in big numbers. Come to think of it, this is where non-English speaking people say "And your point is...?" or possibly "Huh what?" depending on how much contact they've had with big numbers in English-language texts. In binary-speak:

123
= 0 + 64 + 32 + 16 + 8 + 0 + 2 + 1
= 01111011
and so
123.123.0.0
= 01111011.01111011.00000000.00000000
= 01111011011110110000000000000000

... and you never need to stop and calculate what that final monster (123 x 2^24 + 123 x 2^16 + 0 etc.) would be in base ten. (I get two billion and something-- but I didn't pay much attention to the calculator so this may be entirely wrong.)

I could go on, but I suspect we are both getting tired.


* I first typed "123.456.0.0/78" but couldn't do it, even as a meaningless example.

keyplyr




msg:4535405
 4:13 am on Jan 12, 2013 (gmt 0)



There are 10 types of people in the world; those that understand binary & those that don't.

blend27




msg:4535785
 12:27 am on Jan 14, 2013 (gmt 0)

168.62.212.190 is on the roll, why cant MSFT secure their own servers. This IP is all over the map.

OrgAbuseHandle: HOTMA-ARIN

Hot Who?

lucy24




msg:4535807
 3:23 am on Jan 14, 2013 (gmt 0)

I get the impression "abuse at hotmail dot com" is the default contact address for anything belonging to MSN, because if you're Microsoft you obviously don't have time to fill in the form to match the specific IP it applies to.

They don't have a whole /8 block to themselves do they? I looked this up just the other day for some nearby thread. Must've missed the handout by five minutes; they all date from the early '90's.

There are 10 types of people in the world; those that understand binary & those that don't.

Heh. I will file that alongside "There are two kinds of people in the world: the ones who divide the world into two kinds, and the ones who don't." Or, for that matter, "There are two secrets to success in life. The first is: Never tell everything you know."

wilderness




msg:4544410
 1:49 am on Feb 11, 2013 (gmt 0)

479 requests (an approximate 300 the previous day) for robots.txt on one website, and in a twenty four hour period.

Crawl delay did not help, although that is not the intention of crawl delay.

Since redirects won't work!
Anybody have any suggestions on how to stop this nonsense?

lucy24




msg:4544417
 2:14 am on Feb 11, 2013 (gmt 0)

!
So that's why bing/msn's rate of robots.txt requests has dropped so dramatically on my site since last year. They've decided they like your version better.

If you had your own server you could implement some nasty business with firewalls. But on htaccess your only real options are to let them in or not let them in.

:: quick detour to any old log entry ::

ymmv, but in my setup, handing over robots.txt uses fewer bytes than a 403. I do not understand why the numbers are not exactly the same each time --that is, a consistent number for robots.txt and a consistent number for 403-- and I'm fairly certain I would not understand the explanation.

wilderness




msg:4544419
 2:28 am on Feb 11, 2013 (gmt 0)

Any idea what a "500" might do?

lucy24




msg:4544461
 8:42 am on Feb 11, 2013 (gmt 0)

Can you live without bing traffic? ;)

If a legitimate search engine gets anything other than a 200 or 404 when they ask for robots.txt, they're supposed to hold off on crawling the site-- the, ahem, rest of it-- until the robots.txt problem is sorted.

Might be an interesting question to ask bing themselves. Wouldn't they have more time to crawl if they didn't spend so much time reading robots.txt? It's not that riveting is it?

It's all the same robot from the same place, right? They're not showing up from 479 different bing/msn IPs in succession. (I was going to add: And they don't have 479 different user-agents. I guess they do if you count all those MSIE non-robots, but those don't ask for robots.txt in the first place.) They can hardly pretend they haven't seen it. Is there some long-standing bug in the software that's supposed to keep track of robots.txt pickups?

Hm. Wonder if they'd keep asking just as often if you started serving up mendacious 404s. (Not 410: bing doesn't treat those differently from 404s-- and in this case you wouldn't want them to!) Slight problem of course if there are areas of your site/s that you actually don't want bing/msn to crawl...

keyplyr




msg:4544507
 11:47 am on Feb 11, 2013 (gmt 0)


For over a year Bingbot has doing a full crawl of my 260 page site each and every day, usually twice, sometimes more. It requests robots.txt at least 100x per day.

Bingbot also sometimes injects a nonexistent directory into valid file paths rendering them non-valid, so out of the roughly 500 or 600 page requests per day, it generate about 200 404s. I have had numerous conversations with their techs via email. More than once they have promised this behavior would eventually correct itself. It has not.

However I'm reluctant to push the matter since I have been steadily gaining SERP placement over the last year, so much so that I get a very nice amount of traffic from Bing now. That's a good thing since I lost a tad in Google with the second wave of Panda.

lucy24




msg:4544681
 8:10 pm on Feb 11, 2013 (gmt 0)

Bingbot also sometimes injects a nonexistent directory into valid file paths rendering them non-valid

Isn't this one of the things search engines do on purpose to make sure you haven't got any lurking Soft 404s? 200 out of 500 does seem a bit over the top, though.

keyplyr




msg:4544701
 8:41 pm on Feb 11, 2013 (gmt 0)

Isn't this one of the things search engines do on purpose to make sure you haven't got any lurking Soft 404s?

For that purpose, I have seen things like:

http://example.com/NO-EXIST.html

What I'm referring to is this:

http://example.com/random-numbers-here/valid-page.html

Bing says it's not them, that they may be following links from some directory who may have a toxic data-base and that these bad links will not affect my indexing and will eventually drop-off their crawl. They have not.

dstiles




msg:4544719
 9:39 pm on Feb 11, 2013 (gmt 0)

Wilderness - 500 says the server has a fault. Since this may be a temporary off-line thing (eg reboot of MS IIS server after update) the bot should re-visit within a reasonable time to ensure the 500 is not permanent, which could occur if the server were taken off-line permanently (you've moved your site elsewhere but forgotten to tell DNS).

Often the 500 is merely "I'm a bit busy at the moment" - too much traffic to handle.

lucy24




msg:4544732
 10:15 pm on Feb 11, 2013 (gmt 0)

What I'm referring to is this:

http://example.com/random-numbers-here/valid-page.html

Gotcha. I see that pattern sometimes* from inept robots. Yandex sometimes, but not bing yet. Except that generally it isn't a wholly imaginary directory. It's a real directory-- just not the one that contains the file they're asking for. And it isn't due to linking errors from my end.

Hm. I guess you can't downgrade a robot for Low Technical Quality ;)


* Shortly after posting, I went over to chew on logs. And what should pop up but
/paintings/myrats/vwegfmnwngg.html
That's the "real location, bogus filename" variant.

:: wandering off to feed 'vwegfmnwngg' into transcoder to see if it yields a real word in any major legacy font ::

wilderness




msg:4544761
 12:22 am on Feb 12, 2013 (gmt 0)

It's all the same robot from the same place, right? They're not showing up from 479 different bing/msn IPs in succession.


Their all using the UA msnbot-media/1.1.

As to IP, their all MSN IP's, however they are certainly NOT the same.

199.30.
65.55.
157.55.
131.253.

No request for pages with msnbot-media/1.1 UA, all requests are for robots.txt alone.

BTW the requests in the latest 24-hour period slowed down from the previous high, to 250.

lucy24




msg:4544806
 4:25 am on Feb 12, 2013 (gmt 0)

No request for pages with msnbot-media/1.1 UA, all requests are for robots.txt alone.

On my site, msnbot-media stopped asking for pages in late August. (I mention it in the current incarnation of At Home With The Robots.) It's now only images and robots.txt in rock-steady alternation.

At the transition point, it picked up nothing but robots.txt for about a week and a half :)

This 58 message thread spans 2 pages: < < 58 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved