homepage Welcome to WebmasterWorld Guest from 54.211.95.201
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
MSR-ISRCCrawler repurposed
Now with analysis "for Microsoft's Search and Ads services"
caribguy




msg:4003516
 5:33 am on Oct 8, 2009 (gmt 0)

This spider was discussed here [webmasterworld.com] sometime ago.

It's still crawling from 131.107.65.nn
It's still not honoring robots.txt

But now, it's stated purpose has changed from (helping) Live Search understand the rate of change of web pages and understand non-404 error pages, to (analyzing) the web for Microsoft's Search and Ads services.

Somewhat nebulous details found here:

[research.microsoft.com...]

 

Umbra




msg:4003729
 2:11 pm on Oct 8, 2009 (gmt 0)

They write that MSR-ISRCCrawler is "typically from 131.107.65.41" but they don't disclose the range. That IP still doesn't have a reverse DNS entry. And they don't explain the other crap coming from 131.107.*

wilderness




msg:4003843
 5:15 pm on Oct 8, 2009 (gmt 0)

MS will never garnish any credibility from this IP range and should abandon the range.

Umbra




msg:4003915
 6:57 pm on Oct 8, 2009 (gmt 0)

MS will never garnish any credibility from this IP range and should abandon the range.

I wish I had the impunity to dare suggest the same for some of Google's IPs.

wilderness




msg:4003919
 7:05 pm on Oct 8, 2009 (gmt 0)

I wish I had the impunity to dare suggest the same for some of Google's IPs.

It's not impossible.

It's certainly possible to deny access to many of google's IP's and/or "tools" without affecting the crawls by their primary bot.

Umbra




msg:4003927
 7:16 pm on Oct 8, 2009 (gmt 0)

It's not impossible.

It's certainly possible to deny access to many of google's IP's and/or "tools" without affecting the crawls by their primary bot.

Sure, but with Google, there are multiple tools and user agents for the exact same IP. Some IP addresses rotate beween any of Google Wireless Transcoder or translate.google.com or Google Keyword Tool or Google Site Verification or AppEngine-Google or blank user agent or regular browser user agent. So if I block the IP, I don't know exactly what I'm blocking. Plus these IPs are scattershot all over the place, like guerrilla warfare.

At least Microsoft has the courtesy to quarantine all its rogue agents under one IP range.

caribguy




msg:4004047
 12:16 am on Oct 9, 2009 (gmt 0)

I have an equal-opportunity policy when it comes to blocking. A bot's privilege of accessing my server is granted based on its perceived merits. That privilege has just been revoked for 131.107/16


www.example.com 131.107.0.ab "GET /webpage.html HTTP/1.1" 200 10418 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"
www.example.com 131.107.0.aa "GET /somefolder HTTP/1.1" 200 11242 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"
www.example.com 131.107.0.ab "GET /thispage.html HTTP/1.1" 200 10458 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"
www.example.com 131.107.0.ab "GET /oldfolder/redirect HTTP/1.1" 301 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"
www.example.com 131.107.0.ab "GET /newfolder/redirect/target HTTP/1.1" 200 4352 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"
www.example.com 131.107.0.ab "GET /nicetry HTTP/1.1" 403 281 "-" "-"
www.example.com 131.107.0.abc "GET /some/hotlinked/image.jpg HTTP/1.1" 302 28372 "http://forums.example.net/index.php?topic=123456.0" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727; MS-RTC LM 8; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MS-RTC EA 2)"
www.example.com 131.107.0.abc "GET / HTTP/1.1" 200 8573 "http://forums.example.net/index.php?topic=123456.0" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727; MS-RTC LM 8; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MS-RTC EA 2)"
www.example.com 131.107.0.aa "GET /robots.txt HTTP/1.1" 200 411 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
www.example.com 131.107.0.aa "GET / HTTP/1.1" 200 9899 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
www.example.com 131.107.0.aa "GET /robots.txt HTTP/1.1" 200 411 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
www.example.com 131.107.0.aa "GET / HTTP/1.1" 200 9875 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
www.example.com 131.107.65.nnn "GET /robots.txt HTTP/1.1" 200 204 "-" "MSR-ISRCCrawler"
www.example.com 131.107.65.nnn "GET /myfolder/mypage.html HTTP/1.1" 200 3297 "-" "MSR-ISRCCrawler"
www.example.com 131.107.65.nnn "GET /styles.css HTTP/1.1" 200 1709 "-" "MSR-ISRCCrawler"
www.example.com 131.107.65.nnn "GET /robots.txt HTTP/1.1" 200 204 "-" "MSR-ISRCCrawler"
www.example.com 131.107.65.nnn "GET /myfolder/mypage.html HTTP/1.1" 200 3319 "-" "MSR-ISRCCrawler"

And the "human looking" traffic that comes in without a referer from 65.55.n.n is now subject to closer scrutiny.

As always, YMMV...

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved