Forum Moderators: open
I have only seen it come from this single IP 193.47.80.77 (I think we are allowed for a well known public search engine, if not bill can fix it)
I have always seen this one come though a squid proxy server as well.
First sighting was back in 8/25/2007, most recent was 1/5/2008.
[edited by: Ocean10000 at 6:43 pm (utc) on April 13, 2008]
I have only seen it try to pages that have a lot of incoming links usually from ODP.
See, that's where the rub is assuming Exabot has imported or referenced the ODP then they may just be making screen shots for those directory entries which technically didn't violate your crawl restrictions in robots.txt since it wasn't a crawl.
Don't think I'm condoning this practice because these screen shot tools are making me insane with the amount of bandwidth they waste as Snap, Ask, and a whole bunch more are doing this and doing it poorly.
A. They crawl and cache the page yet pull down the page yet again to make the screen shot.
B. They don't cache the images for your site as they appear to clear the entire cache for the screen shot tool with every access.
C. They don't just make a screen shot of the pages other sites link to, such as the home page, they try to screen shot every page of your site.
D. The new screen shot apps like SearchMe (which is cool) make such big screen shots that they are totally legible on the search so people don't even need to visit the site once they see the search results and your site income suddenly vaporizes.
Let's do some math here:
Assume a site like mine has 100K pages minimum and each page is 10K minimum with at least 1 CSS page (@ 5k), 1 javascript page (@ 5k) and 10 images per page averaging 3.5K each.
You normally assume the CSS, javascript and images amounting for 45K total will be cached on the first access.
OK, so now they they crawl the HTML page 1 time and then download the same page yet again to make a screen shot (including all support files) for a total of 65K for a single page.
Now do that to 100K pages and your bandwidth is 6.5GB just for one search engine alone to fully crawl and screen shot your entire site.
Now imagine if 10 search engines all decided to make screen shots... 65GB
Now imagine your site has more than 100K pages...
I'm starting quickly to see a potential future where the servers talking to each other use more bandwidth than all the humans using the internet combined!
Probably already a fact.