dstiles - 10:13 pm on Nov 11, 2010 (gmt 0)
The following is adapted from some of my postings in the WebmasterWorld forum "Search Engine Spider and User Agent Identification" topic "Google Web Preview"...
Reading the GWMT Help page:
"Google updates the Instant Preview snapshot as part of our web crawling process. Google also uses the user-agent Google Web Preview (Mozilla/5.0..."
Does this mean that the original preview is created by googlebot or by preview bot?
I block the preview bot with a 403 so it must be googlebot (but see below).
One of our client sites (at least) looks terrible: we've blocked furniture images from google in robots.txt (for most of our sites, in fact) so only the text is shown. I would not click on the page so doubtless the site will suffer.
On the other hand... one of our own sites blocks both img (furniture) and pics (topic photos) and the whole site is displayed IN FULL! So someone is disobeying robots.txt (and yes, it is correct!). (Just seen a second of our sites with full images including disallowed furniture.)
Since I have preview bot blocked with 403 they are either coming at the site through the punter's IP or are sucking via googlebot OR through an unrecognised IP using a "real" browser identifier.
The latter may be true. The punter option seems more likely EXCEPT I can't see any proxying of the punter's IP so if that is true they are also falsifying the source IP, which I can't see happening unless they have become really devious!
A client's site has pics in cache view but not in preview EXCEPT this site did not block pics until quite late (probably May 2009) and those images ARE shown, even though this breaks the recommendation in robots.txt. This is difficult to determine absolutely since pics on some pages are old and some new.
Another client site shows pics even though robots.txt says not to BUT only on some pages. These pics (AND furniture) have always been disallowed but again are in cache view (so google has been breaking robots.txt protocol for some time... never thought of that before regarding cache).
I'm guessing here that the missing pics are probably due to google not having scraped them yet.
Another client site we run has several iframes per page (not on all pages). This seems to have caused only minor problems to preview, which shows the full iframed page WITH contents for specific keywords (but not (always?) the pics but always the furniture). Furniture is shown but not product pics on the pages we've seen so far but again may be scraped yet for preview.
I think the NOSNIPPET tag should be replaced by or supplemented with a NOTHEFT tag.