I see it crawling but didn't see any images being downloaded.
Probably because I block all images from being downloaded ;)
Actually, they could be combining the actions of the normal crawler and the image crawler but if you're seeing back-to-back download a page then download the page again with all the images it sounds like possible screen shots.
|I see it crawling but didn't see any images being downloaded. |
I didn't see images downloaded in this crawl. But the total size of the second request would equal page plus images, so I am assuming these are snapshots.
|...I block all images from being downloaded |
LOL, really? Then what's the point of having images?
I block image downloads from remote servers and other off-site referrers (w/ some exceptions)
Spot the difference :)
www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:37 -0500] "GET /widget HTTP/1.1" 200 16511 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
www.example.com 65.55.106.nnn - - [19/Sep/2009:14:14:46 -0500] "GET /widget HTTP/1.0" 200 85960 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
HTTP/1.1 vs HTTP/1.0 the former uses .gz
caribguy - that's it. I do gzip html files. Care to suggest a reason why msnbot would do this?
|LOL, really? Then what's the point of having images? |
I let users download them, but not the SEs.
The SE image index is one huge image theft ring that people use without asking.
Worse case, I found a bunch of unscrupulous sites using Google images to locate my thousands of images and hotlink them into their pages.
We had a massive assault on that nonsense, all hotlinks blocked, all images blocked from SEs downloading.
FYI, the images I'm talking about in this case was my library of 40K+ site screen shots so you could see why some wise guys thought I'd be a good source for a free ride.
|HTTP/1.1 vs HTTP/1.0 the former uses .gz |
Isn't that backwards? Shouldn't it be the HTTP/1.1 using .gz?
Yep 1.1 uses .gz - former, as in mentioned first...
I wouldn't dare to even wager a guess on why M$ is doing this. To me it falls in the same category as the referrer spam discussed here before, or their attempts to grab images and truncated urls with the WinHTTP user agent...
Very tempting to add yet another directive to my rewrite rules...
Made me look back through my database of this year's logs... The critter is there, but not in great numbers. But thanks for the heads up... I will watch this for a month or so and see if any changes need to be made in .htaccess
|I let users download them, but not the SEs. |
I know what you meant Bill, I was only joking.
I take a slightly different tactic. I block all image requests from off-site origins, but I do allow the Big 3 (4?) SEs to put most image files in their image search libraries.
When the SE users click on these thumbnailed images, instead of the SE's page that hot-links to my image file, my scripting displays the page of origin (my site.) Thus my 10k images serve another function to increase traffic.