@enigma1: Well, let me reply by reworking the old Who-What-When-Where-Why, etc., adage. If you opt to slog through this, you might want to grab a beverage first:)
HOW MUCH 1.) Stats-wise, I'd like to know if the extra bandwidth Google Web Preview requires is worth GWP hitting my sites multiple times a day, downloading darn near ALL of my files.
WHO-WHAT 2.) Even more importantly, I'd rather keep G's access -- and the accesses of those people/bots piggybacking on that access through translate, appEngine, etc. -- tighter rather than looser.
I do rDNS on the server and when it's just Googlebot coming from .googlebot,.com, I know what's happening. All of a sudden, it's Google Web Preview coming from .googlebot.com AND a plethora of IPs, including --
74.125.156.82
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
robots.txt? NO
64.233.172.18
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
robots.txt? NO
Thus my Google-related .htaccess lines went from a brief bit, e.g. --
RewriteCond %{REMOTE_HOST} !\.googlebot\.com$
-- to the following mass, with a corresponding increase in server resource usage generally, and rewrite_log sizes in particular:
RewriteCond %{REMOTE_ADDR} !^64\.233\.
RewriteCond %{REMOTE_ADDR} !^66\.239\.
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteCond %{REMOTE_ADDR} !^72\.14\.
RewriteCond %{REMOTE_ADDR} !^74\.125\.
RewriteCond %{REMOTE_ADDR} !^209\.85\.
RewriteCond %{REMOTE_ADDR} !^216\.239\.
(...and who knows how many more to come)
So the G-Game changes: Want Google to show a snippet? Gotta let GWP in, whole hog. Wanna reverse-verify GWP -- ditto Googlebot -- is from Google? Not gonna happen:
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot). Source:
Verifying Googlebot [google.com]
WHY 3.) I don't like giving Googlebot (read: Google, Inc.) access to image and other no-bots-allowed directories just because GWP must to see everything. I'm with Samizdata: GWP... "exists to bypass the robots.txt restrictions placed on Googlebot." [
webmasterworld.com...]
Looks to me like removing Disallows in robots.txt to satisfactorily 'open' things up to GWP/Googlebot means we now exclude only via GWT's URL removal -- assuming the URLs meet Google's approval, that is. [
google.com...] In short: Chase your horses after Google errantly lets 'em out of the barn, folks.
So ANYway (sorry you asked enigma1?:) --
For all of those reasons and more, I'd like to know if giving GWP (& thus Googlebot) the keys to the kingdom is worth it, via Google adding a GWP-specific referer tag, something. Yep, I know: Dream on.
WHEN 4.) In the last 10 days or so, I've spent hours and hours of unrecompensed time giving the keys to Google in .htaccess and robots.txt because if I don't, my sites appear below scads of sites ranging from alexa and robtex to Chinese-based spam sites, pretty much any sites that include my URLs in their Titles.
I confess.
I've sold out to G for the price of snippets written to their specs, preview images (poorly) made and altered by their programs, urchin stats collected by their code, and rankings determined by their algos. I let G assimilate me. For free.