homepage Welcome to WebmasterWorld Guest from 54.145.183.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
the plainclothes bingbot
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4621945 posted 11:06 pm on Nov 7, 2013 (gmt 0)

Moderators-- I consider this an SSID topic. But you may decide it's more suited to a bing/msn forum. This paragraph will self-destruct in 60 seconds.

I am not the only person who has been alternately puzzled and vexed by this humanoid. Can we take it as read that the Microsoft corporation does not employ an army of humans with elderly computers and too much time on their hands? The thing is a robot. But, unlike all other bing-affiliated robots, it doesn't ask for robots.txt every five minutes. Instead it asks for it ... never. Admittedly it could be reading over the bingbot's shoulder, and it has never yet asked for a roboted-out page. But it's the principle of the thing.

Starting in early October I've been tracking it. This involves a two-pronged approach: first unblocking the plainclothes bingbot globally-- including letting it run wild in piwik-- and then flagging it in logs. Disclaimer: I suspect its behavior changed during the time I've been tracking it. I mean by coincidence, not for Schroedinger's-cat reasons. We Shall See.

First discovery: there are two of them. Possibly two and a half.

User Agents

#1: MSIE 7. The exact configuration varies from one visit to the next. It's always Windows NT 5.0 or 5.1 (i.e. Windows 2000 or XP, dating back to before 2006) with assorted NET CLR add-ons seemingly at random. According to piwik, it always has a resolution of 800x600. (!)

#2 MSIE 9. This one is always exactly
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  WOW64;  Trident/5.0)
(extra spaces to either side of "WOW64;") with resolution of 1024x768

#3 plausible exception: In October, one visited page included a midi file. For this, the robot switched to the "contype" UA. (I looked this up. It's what earlier MSIE versions used for some types of media files.)

IP

#1 "older ranges" 65.55.211-213, ..215, ..217-218 and 131.253.23-26, ..36 (I checked back: they really do seem to skip .214 and .216 at all times, and they're selective in the 131.253 area). The first is the identical range used by msnbot-media. Usually MSIE 7, sometimes MSIE 9.

#2 "newer range" 199.30.24-25 always MSIE 9. I never saw this range before April, and then only for Preview until early October when it started doing plainclothes duty as well.

Behavior

Both UAs from both IPs pick up non-page files: css, js, midi, pdf. Only MSIE9, and then only from 199.30, picks up images. All non-page files give the page as referer. They never ask for the favicon.

Older ranges: These are more leisurely visits, ranging from a second or two to (rarely) up to 20 seconds from beginning to end. On rare occasions this IP doesn't pick up, or doesn't act on, javascript (in my case generally piwik). IPs can vary within a visit, even hopping between 65.55 and 131.253.

199.30 range: Fully humanoid apart from favicon. Typically in and out within a second or two, exactly like a human; piwik.js is typically requested after all images, reflecting its physical location in the html. Identical IP for the duration of each visit.

Around the middle of October there was a cluster of 199.30 visits (humanoid with images) coming pretty exactly 30 seconds after a visit to the same page by Bing Preview, in each case using the identical IP for both. Each preview seems to have been triggered by text search, not image. (This is assuming that image search is always accompanied by an image fetch.)

Aside: I honestly don't know what the scoop is with Bing Preview. There's never any information about a search-- or indeed any referer at all. And, as noted elsewhere, I can't even figure out how to get a Preview when searching in my own persona. And, finally, am I the only one who thinks it's funny that Bing Preview uses webkit rather than some form of MSIE?

Javascript

Both UAs and both IPs act on javascript, whether or not they get images. In my case, this generally means piwik. They send a full information packet, not the administrative pixel sent out to visitors with scripting turned off.

Especially interesting detail: One of the mid-month Preview-plus-plainclothes visits-- and also an earlier Preview alone-- was to one of the rare pages that uses javascript for something other than analytics. Thanks to this page, I know that both Preview and the MSIE 9 robot claim to have the Euphemia font, but not a third-party font that the page also tests for. (Is there any way to fake this? I guess theoretically yes, but surely more trouble than it's worth.) Exactly what I'd expect of a human with the same UA.

Conclusion

... none, actually. I still have no idea what the thing(s) is/are for. But so far it hasn't done anything really egregious.

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4621945 posted 12:46 am on Nov 14, 2013 (gmt 0)

Both UAs from both IPs pick up non-page files: css, js, midi, pdf. Only MSIE9, and then only from 199.30, picks up images. All non-page files give the page as referer. They never ask for the favicon.


Hey lucy,
A simple UA on Preview (NC) will stop this.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4621945 posted 1:54 am on Nov 14, 2013 (gmt 0)

You missed the part where I explained I was blocking both (Preview and plainclothes) for ages. I intentionally unblocked them out of curiosity to see what they'd do if I let them run wild.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4621945 posted 5:01 am on Nov 14, 2013 (gmt 0)

I forgot to mention that they all add to the verisimilitude by sending an
Accept-Language: en-us
header. The MSIE 7 version also says
Ua-Cpu: x86
The 199.30 version-- the one that gets images-- says
Accept: image/jpeg, image/gif, image/pjpeg, */*
while the others say
Accept: */*
only. Interesting detail, because it means I could theoretically know ahead of time if it's going to request images or not, even without looking at the IP.

It has been quiet for a while but in the past 24 hours there was an absolute blizzard of interest, including several repeat visits. (No cookies.) One of them was to the same midi-containing page as before, but this time it was with the MSIE 9 UA, making the "contype" --HEAD only this time, not GET-- a little less plausible. (Uh... MSIE 9 in real life can read midi files on its own, can't it?)

I suspect they may be phasing out MSIE 7. Right now, that's probably the oldest MSIE that has a fair chance of getting in to almost all sites.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4621945 posted 12:14 pm on Nov 14, 2013 (gmt 0)



As far back as I can remember, there have always been several covert bots from M$. One basically just grabbed images with a HEAD check first. One of the others nosed around on back pages that AFAIK never had incoming links. Could never figure them out. They were always around, so eventually I ignored them.

Lately I've seen one as you've described. I just figured it was collecting parental rating or browser security threat info for M$IE.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4621945 posted 8:34 pm on Nov 14, 2013 (gmt 0)

Lucy - I have 199.30/16 fully blocked. No idea if any type of bot comes from there but have never seen one.

See today's [webmasterworld.com...] for my comment on MS UAs. I wonder if they are using old UAs as a way of determining site reactions to them?

I suspect they are trying to be honest about using webkit - it is a versatile client, after all.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4621945 posted 1:45 pm on Nov 26, 2013 (gmt 0)

I havent seen a lot from this range for a while, and here it comes:

ip: 131.107.192.167
rdns: 131.107.192.167
time: {ts '2013-11-25 22:45:28'}
method: GET
protocol: HTTP/1.1
host: www.example.com
user-agent: asynchttp
content-length: 0
Cache-Control: no-cache

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4621945 posted 8:03 pm on Dec 14, 2013 (gmt 0)

:: further bump ::

Can't figure out if this is another version of the plainclothes bingbot or something else entirely. I've definitely never seen it before.

The first IP is apparently some human in Belgium. All others are ... well, you know what. But it's not one of the ranges the plainclothes bingbot has been using.

81.241.227.nnn - - [13/Dec/2013:07:39:10 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30768 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
157.55.43.25 - - [13/Dec/2013:07:39:39 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30731 {same}

So far it looks like an image search with suppressed referer. The next one caught my attention because it's the same 25-30-second lag that I see when Bing Preview is followed by the plainclothes bingbot:

157.55.0.155 - - [13/Dec/2013:07:40:06 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30731 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)"

I am not used to meeting MSIE UAs without a long string of NET CLR blahblahs at the end, but this seems to be normal for MSIE 10.

Watch, it's not done yet. (It may still not be done: I noticed it yesterday and found a couple more this morning.)

157.55.0.155 - - [13/Dec/2013:09:52:31 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30731 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 820)"
157.55.0.155 - - [14/Dec/2013:01:22:46 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30731 {same}
157.55.0.155 - - [14/Dec/2013:05:14:26 -0800] "GET /paintings/critters/blowups/largebunny7.jpg HTTP/1.1" 200 30731 {same}

This isn't some new mobile bingbot is it? Why would it be picking up just one image, over and over again?

I detoured to check one more thing. Far as I can make out, the bingbot --by any name-- simply doesn't "do" 304s. It's routine with other search engines, even for time spans much longer than a few hours. But 157.55 isn't a major source of image requests at any time.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved