homepage Welcome to WebmasterWorld Guest from 54.161.220.160
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
msnbot/2 snapshots?
keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 10:16 am on Sep 19, 2009 (gmt 0)


verified rDNS, normal crawl range.

Hit each page on a 150 page site twice in succession, response code 200 for all. 1st time the size of file is normal. The second time it looks like the file size of the html plus the images. So I'm assuming msnbot/2 is taking snapshots. Anyone else verify this?

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 2:54 pm on Sep 19, 2009 (gmt 0)

I see it crawling but didn't see any images being downloaded.

Probably because I block all images from being downloaded ;)

Actually, they could be combining the actions of the normal crawler and the image crawler but if you're seeing back-to-back download a page then download the page again with all the images it sounds like possible screen shots.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 7:27 pm on Sep 19, 2009 (gmt 0)

I see it crawling but didn't see any images being downloaded.

I didn't see images downloaded in this crawl. But the total size of the second request would equal page plus images, so I am assuming these are snapshots.

...I block all images from being downloaded

LOL, really? Then what's the point of having images?

I block image downloads from remote servers and other off-site referrers (w/ some exceptions)

caribguy

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3992111 posted 7:34 pm on Sep 19, 2009 (gmt 0)

Spot the difference :)

www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:37 -0500] "GET /widget HTTP/1.1" 200 16511 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:46 -0500] "GET /widget HTTP/1.0" 200 85960 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

HTTP/1.1 vs HTTP/1.0 the former uses .gz

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 11:24 pm on Sep 19, 2009 (gmt 0)

caribguy - that's it. I do gzip html files. Care to suggest a reason why msnbot would do this?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 12:48 am on Sep 20, 2009 (gmt 0)

LOL, really? Then what's the point of having images?

I let users download them, but not the SEs.

The SE image index is one huge image theft ring that people use without asking.

Worse case, I found a bunch of unscrupulous sites using Google images to locate my thousands of images and hotlink them into their pages.

We had a massive assault on that nonsense, all hotlinks blocked, all images blocked from SEs downloading.

FYI, the images I'm talking about in this case was my library of 40K+ site screen shots so you could see why some wise guys thought I'd be a good source for a free ride.

HTTP/1.1 vs HTTP/1.0 the former uses .gz

Isn't that backwards? Shouldn't it be the HTTP/1.1 using .gz?

caribguy

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3992111 posted 1:04 am on Sep 20, 2009 (gmt 0)

Yep 1.1 uses .gz - former, as in mentioned first...

I wouldn't dare to even wager a guess on why M$ is doing this. To me it falls in the same category as the referrer spam discussed here before, or their attempts to grab images and truncated urls with the WinHTTP user agent...

Very tempting to add yet another directive to my rewrite rules...

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 3:28 am on Sep 20, 2009 (gmt 0)

Made me look back through my database of this year's logs... The critter is there, but not in great numbers. But thanks for the heads up... I will watch this for a month or so and see if any changes need to be made in .htaccess

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3992111 posted 6:44 am on Sep 20, 2009 (gmt 0)

I let users download them, but not the SEs.

I know what you meant Bill, I was only joking.

I take a slightly different tactic. I block all image requests from off-site origins, but I do allow the Big 3 (4?) SEs to put most image files in their image search libraries.

When the SE users click on these thumbnailed images, instead of the SE's page that hot-links to my image file, my scripting displays the page of origin (my site.) Thus my 10k images serve another function to increase traffic.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved