homepage Welcome to WebmasterWorld Guest from 23.23.8.131
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
MS Crawler Hiding as a Browser
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4402061 posted 2:52 am on Dec 29, 2011 (gmt 0)

Here's something that quacks like a browser from from the wonderful world of Microsoft IPs.

Once anything asks for robots.txt on the site it bags and tags you as a robot, and will now give you 403 forbidden forever until you answer the Turing test. Obviously a bot, doesn't seem to mind the 403's, never asks for images or css, just bangs it's head on some 403s and leaves.

65.52.0.229 - - [08/Dec/2011:07:04:28 +0000] "GET / HTTP/1.1" 200 4047 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:07:06:49 +0000] "GET /robots.txt HTTP/1.1" 200 731 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:09:20:38 +0000] "GET /some_page.php HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [08/Dec/2011:16:00:06 +0000] "GET / HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [09/Dec/2011:05:23:59 +0000] "GET / HTTP/1.1" 403 804 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
65.52.0.229 - - [24/Dec/2011:11:56:39 +0000] "GET /faq.htm HTTP/1.1" 403 1625 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

 

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4402061 posted 4:41 am on Dec 29, 2011 (gmt 0)

I've been seeing that and other covert activity from M$ ranges forever it seems, but lately even more.

DeeCee



 
Msg#: 4402061 posted 6:43 am on Dec 29, 2011 (gmt 0)

Microsoft do this constantly but are not intelligent enough to add an MS identifier to their "pretend" agents.

Google does the same thing, to test your output to various types of "agents", such as types of mobile phones and other, but they at least almost always add some kind of 'Google' tag at the end, to show it is them.

What it means overall it that not only does each bot-owner come by once to scrape content, but they come by 1,2,3+ and more times to scrape the same content as different things, adding more server and network load. Testing iPhone, Blackberry, ..., ...

Add to that 'Google Preview', which I am expecting will have a Microsoft clone soon. Preview allows them to keep the average potential advertise-clicker user on their own site longer longer. Away from actual Internet content.

I get more and more annoyed with all these content and information scrapers every day.
Try watching when a blog entry is posted, connected to twitter.
It is a whole swarm of bots that arrive for many hours after each posting feeding off twitter. Most of them anonymous scrapers. I just a few hours minutes ago added Vocus, Inc (also owner of PrWeb) to my block lists. They are still hacking away, retrying over and over every few minutes, despite nothing but 403s in response. All that work, and with NO agent-string at all to show their identity to the average site owner.

I guess we should be happy Microsoft even bother to have an agent-string.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4402061 posted 1:11 pm on Dec 29, 2011 (gmt 0)

I guess we should be happy Microsoft even bother to have an agent-string.

Ah, but MSN's bots routinely do not. Just yesterday:

IP: 65.52.32.151
UA: -

17:58:44 /
17:58:55 /
17:59:09 /
17:59:20 /
17:59:32 /
17:59:43 /
17:59:54 /
18:00:06 /
18:00:17 /
18:00:29 /
18:00:40 /

robots.txt? NO

The exact same 'no UA, no robots.txt, no referrer, 11-hits-to-same-file' pattern has been going on for years. [webmasterworld.com...]

DeeCee



 
Msg#: 4402061 posted 4:32 pm on Dec 29, 2011 (gmt 0)

You are correct.
I think web-site owners have been getting too complacent about protecting their content against scrapers.
Whether those content scrapers claim to be a search engines of various kinds or not.

Everyone and their mother (and sisters and 6-month old baby-brothers) today think they should create another data-collector site, a site selling competitor spying, link tracker site, "monitor what is said about you" site, or other. Whatever they "claim" they are thinking of being at this particular time (usually a lie on their web-site, if it can be found). The problem is that we never really know what they will do now or in the future with all that content and/or information.

Anonymous agents DO NOT get in on my servers.

A simple

SetEnvIf User-Agent "^$" bad_bot="NoAgentString~impersonator"


with various types of blocks for the bad_bot environment variable kills off the blanks.

If you do not even want to tell me who you are, you do not get in.
Similarly, if you are too stupid to change the Agent-String from the default string in the public code library you used to create yet another scraper, you do not get in either.

All the "Jakarta Commons-HttpClient", "Zend_Http_Client", "RomeClient", and other anonymous bots gets killed too. Anyone wanting to scrape without an accurate Agent String gets killed.
So does all the human impersonators (crawlers with only "human" type strings), when caught.

If you want to steal other people's content, the least you can do is say who you are, and how letting you steal it will help the site owner.

The higher the percentage of site owners that block them, the less value their databases will have.
Missing information and missing sites dilutes the value of those datasets and will cost them customers.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4402061 posted 7:37 pm on Dec 29, 2011 (gmt 0)

Bill - none of those UAs looks genuine to me. There are usually extra "fields" in them proclaiming (usually ad nauseam) what .NET etc version they are.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4402061 posted 6:55 am on Jan 9, 2012 (gmt 0)

Technically it's a plain vanilla MSIE 7 install on Vista, highly unlikely, but possible.

My guess based on behavior is it's a screen shot tool but it seems unlikely they'd use that MSIE version or Vista for that matter.

Don't forget MS has some cloud computer service they resell if I'm not mistaken, like AWS, so I'm completely suspicious about whatever this stuff is at this point.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4402061 posted 8:21 am on Jan 9, 2012 (gmt 0)

Don't forget MS has some cloud computer service they resell if I'm not mistaken, like AWS

It's called Azure

70.37.0.0 - 70.37.191.255
70.37.0.0/17
70.37.128.0/18

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4402061 posted 9:00 am on Jan 9, 2012 (gmt 0)

Technically it's a plain vanilla MSIE 7 install on Vista, highly unlikely, but possible.


As long as we're talking about vanilla:

Is there the smallest iota of possibility that the related form

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

could ever be a human? I've only just become aware of it, because it hangs out in places that are already blocked by IP. But I'm perfectly happy to throw in the UA itself as a backup ;)

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4402061 posted 1:41 pm on Jan 9, 2012 (gmt 0)

Lucy, I see this UA mainly used by non-humans from china (blocked of course)

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4402061 posted 2:12 pm on Jan 9, 2012 (gmt 0)

What I found interesting is the MSIE 7.0 tag mentioned. I was thinking it may have something to do with the recent news, IE6 dropped below 1%? Surely they can influence stats to some degree and make sensational news.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4402061 posted 3:13 pm on Jan 9, 2012 (gmt 0)

1.) Spotted the OP's bot-as-browser UA post-tweet:

65.52.0.229
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

The tweeted link was to a plain.txt file denied to all SEs via robots.txt and is not included on MSN's sitemap.xml. The above hit 'browsed' straight to it.

2.) Here's yet another bot-as-browser from .search.msn.com:

msnbot-207-46-12-61.search.msn.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707)

robots.txt? NO

That's just one more in an ever-lengthening list of bots-as-browsers from .search.msn.com Hosts and IPs.

3.) Microsoft's Dynamic Hosting IPs now also spawn the long-familiar (to me, at least) 'No UA, no robots.txt, no referrer, 11-hits-to-same-file' visits: "MSN's Stealth Missions" [webmasterworld.com...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4402061 posted 2:53 am on Jan 11, 2012 (gmt 0)

This just came in again, post-Tweet:

65.52.0.229
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

robots.txt? NO

That's seven sightings of the same IP+UA between Bill's OP and my replies. Bill, were your hits post-tweet? Might this be a Twitter-specific thing?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4402061 posted 4:19 am on Jan 11, 2012 (gmt 0)

There's an addendum to the robots.txt standard, but it's written in invisible ink-- or set to {display: none} or equivalent. Robots wearing street clothes don't count as robots and therefore don't need to read, let alone obey, robots.txt. My plainclothes MSNbots always pick up two things: the page itself, and its associated piwik.js file in a roboted-out directory. I mean of course "try to pick up the js file".

Food for thought: Since MS is openly waiting for the day MSIE 6 disappears from the face of the planet* does that mean they're forced to dress their robots in MSIE 7 for verisimilitude?


* Or at least that part of it south of the 60th parallel. What the bleepity bleepity do locutions like "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB7.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C)" mean anyway?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved