homepage Welcome to WebmasterWorld Guest from 54.235.36.164
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
msnbot/2 snapshots?
keyplyr




msg:3992113
 10:16 am on Sep 19, 2009 (gmt 0)


verified rDNS, normal crawl range.

Hit each page on a 150 page site twice in succession, response code 200 for all. 1st time the size of file is normal. The second time it looks like the file size of the html plus the images. So I'm assuming msnbot/2 is taking snapshots. Anyone else verify this?

 

incrediBILL




msg:3992164
 2:54 pm on Sep 19, 2009 (gmt 0)

I see it crawling but didn't see any images being downloaded.

Probably because I block all images from being downloaded ;)

Actually, they could be combining the actions of the normal crawler and the image crawler but if you're seeing back-to-back download a page then download the page again with all the images it sounds like possible screen shots.

keyplyr




msg:3992233
 7:27 pm on Sep 19, 2009 (gmt 0)

I see it crawling but didn't see any images being downloaded.

I didn't see images downloaded in this crawl. But the total size of the second request would equal page plus images, so I am assuming these are snapshots.

...I block all images from being downloaded

LOL, really? Then what's the point of having images?

I block image downloads from remote servers and other off-site referrers (w/ some exceptions)

caribguy




msg:3992234
 7:34 pm on Sep 19, 2009 (gmt 0)

Spot the difference :)

www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:37 -0500] "GET /widget HTTP/1.1" 200 16511 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:46 -0500] "GET /widget HTTP/1.0" 200 85960 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

HTTP/1.1 vs HTTP/1.0 the former uses .gz

keyplyr




msg:3992280
 11:24 pm on Sep 19, 2009 (gmt 0)

caribguy - that's it. I do gzip html files. Care to suggest a reason why msnbot would do this?

incrediBILL




msg:3992303
 12:48 am on Sep 20, 2009 (gmt 0)

LOL, really? Then what's the point of having images?

I let users download them, but not the SEs.

The SE image index is one huge image theft ring that people use without asking.

Worse case, I found a bunch of unscrupulous sites using Google images to locate my thousands of images and hotlink them into their pages.

We had a massive assault on that nonsense, all hotlinks blocked, all images blocked from SEs downloading.

FYI, the images I'm talking about in this case was my library of 40K+ site screen shots so you could see why some wise guys thought I'd be a good source for a free ride.

HTTP/1.1 vs HTTP/1.0 the former uses .gz

Isn't that backwards? Shouldn't it be the HTTP/1.1 using .gz?

caribguy




msg:3992304
 1:04 am on Sep 20, 2009 (gmt 0)

Yep 1.1 uses .gz - former, as in mentioned first...

I wouldn't dare to even wager a guess on why M$ is doing this. To me it falls in the same category as the referrer spam discussed here before, or their attempts to grab images and truncated urls with the WinHTTP user agent...

Very tempting to add yet another directive to my rewrite rules...

tangor




msg:3992334
 3:28 am on Sep 20, 2009 (gmt 0)

Made me look back through my database of this year's logs... The critter is there, but not in great numbers. But thanks for the heads up... I will watch this for a month or so and see if any changes need to be made in .htaccess

keyplyr




msg:3992366
 6:44 am on Sep 20, 2009 (gmt 0)

I let users download them, but not the SEs.

I know what you meant Bill, I was only joking.

I take a slightly different tactic. I block all image requests from off-site origins, but I do allow the Big 3 (4?) SEs to put most image files in their image search libraries.

When the SE users click on these thumbnailed images, instead of the SE's page that hot-links to my image file, my scripting displays the page of origin (my site.) Thus my 10k images serve another function to increase traffic.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved