homepage Welcome to WebmasterWorld Guest from 54.226.43.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 66 message thread spans 3 pages: < < 66 ( 1 2 [3]     
Google Web Preview
Mokita




msg:4223020
 10:02 pm on Oct 27, 2010 (gmt 0)

Has this been mentioned here previously? I couldn't find anything in a search.

Found it crawling one of our sites last night - thought it odd, as it was coming from the 66.249.64.0/19 range normally used by googlebot.

The full UA is:

Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

Looking for information, I found this:

Google has been caught testing a major new layout to their search results full page previews of the target site and pale blue backgrounds behind the search results when you hover over them.
...
One of the fascinating things about this is that they are highlighting certain sections of the page in orange and expanding the text to provide a snippet of information. This shows that they have the technology to know exactly where a piece of text is on every single web page. The snippets highlighted are not always the same as the snippet in the search results.


The obvious question raised by this, is the effect it will have on click-through rates.

 

Samizdata




msg:4233836
 10:04 pm on Nov 22, 2010 (gmt 0)

A Google employee quoted in The Register:

We're working on a solution for this, to prevent Google Instant Preview on-demand fetches from executing Analytics JavaScript

So much for testing the beast before unleashing it on the world.

It would also have been sensible - not to mention polite - to warn webmasters in advance that they are expected to remove their robots.txt restrictions so that their images can be used.

As for offering a "nopreview" tag, they wouldn't want to copy Bing, would they?

...

Pfui




msg:4236133
 6:40 am on Nov 29, 2010 (gmt 0)

Related? Jury's still out on who-what-why --

What is this: url(data:image/png;base64
Strange coding in server logs.
[webmasterworld.com...]

enigma1




msg:4236884
 3:55 pm on Nov 30, 2010 (gmt 0)

I'd like to see referers from people actually clicking on/through the Preview

Why? I browse with the referrer hidden constantly, as I value my privacy and I would recommend to others to do the same.

Now for the preview every request I see in my logs many seemed fake and got kicked. Every single one of them. To this point it appears like a scraping tool at least from the headers so it's not getting in.

Pfui




msg:4237046
 8:29 pm on Nov 30, 2010 (gmt 0)

@enigma1: Well, let me reply by reworking the old Who-What-When-Where-Why, etc., adage. If you opt to slog through this, you might want to grab a beverage first:)

HOW MUCH

1.) Stats-wise, I'd like to know if the extra bandwidth Google Web Preview requires is worth GWP hitting my sites multiple times a day, downloading darn near ALL of my files.

WHO-WHAT

2.) Even more importantly, I'd rather keep G's access -- and the accesses of those people/bots piggybacking on that access through translate, appEngine, etc. -- tighter rather than looser.

I do rDNS on the server and when it's just Googlebot coming from .googlebot,.com, I know what's happening. All of a sudden, it's Google Web Preview coming from .googlebot.com AND a plethora of IPs, including --

74.125.156.82
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
robots.txt? NO

64.233.172.18
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
robots.txt? NO

Thus my Google-related .htaccess lines went from a brief bit, e.g. --

RewriteCond %{REMOTE_HOST} !\.googlebot\.com$

-- to the following mass, with a corresponding increase in server resource usage generally, and rewrite_log sizes in particular:

RewriteCond %{REMOTE_ADDR} !^64\.233\.
RewriteCond %{REMOTE_ADDR} !^66\.239\.
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteCond %{REMOTE_ADDR} !^72\.14\.
RewriteCond %{REMOTE_ADDR} !^74\.125\.
RewriteCond %{REMOTE_ADDR} !^209\.85\.
RewriteCond %{REMOTE_ADDR} !^216\.239\.
(...and who knows how many more to come)

So the G-Game changes: Want Google to show a snippet? Gotta let GWP in, whole hog. Wanna reverse-verify GWP -- ditto Googlebot -- is from Google? Not gonna happen:

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).
Source: Verifying Googlebot [google.com]

WHY

3.) I don't like giving Googlebot (read: Google, Inc.) access to image and other no-bots-allowed directories just because GWP must to see everything. I'm with Samizdata: GWP... "exists to bypass the robots.txt restrictions placed on Googlebot." [webmasterworld.com...]

Looks to me like removing Disallows in robots.txt to satisfactorily 'open' things up to GWP/Googlebot means we now exclude only via GWT's URL removal -- assuming the URLs meet Google's approval, that is. [google.com...] In short: Chase your horses after Google errantly lets 'em out of the barn, folks.

So ANYway (sorry you asked enigma1?:) --

For all of those reasons and more, I'd like to know if giving GWP (& thus Googlebot) the keys to the kingdom is worth it, via Google adding a GWP-specific referer tag, something. Yep, I know: Dream on.

WHEN

4.) In the last 10 days or so, I've spent hours and hours of unrecompensed time giving the keys to Google in .htaccess and robots.txt because if I don't, my sites appear below scads of sites ranging from alexa and robtex to Chinese-based spam sites, pretty much any sites that include my URLs in their Titles.

I confess.

I've sold out to G for the price of snippets written to their specs, preview images (poorly) made and altered by their programs, urchin stats collected by their code, and rankings determined by their algos. I let G assimilate me. For free.

Samizdata




msg:4237108
 10:56 pm on Nov 30, 2010 (gmt 0)

I've sold out

I'd say you made a pragmatic decision in the face of an ugly reality.

Google made you an offer you couldn't refuse.

...

Pfui




msg:4241465
 11:55 pm on Dec 11, 2010 (gmt 0)

After more pragmatic (heh:) head-banging, two startling updates about:

1.) Google Web Preview (GWP)

To date, this UA continues to hit pages/files/etc., and without referers. Then today, exactly 45 minutes -- to the second -- after hitting one page and its files, GWP had a ref, a FAKE one:

72.14.194.17
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
12/11 /favicon.ico

robots.txt? NO
Fake Ref? YES: http://www.freewebsitereport.org/www.mydomainhere.com

What the--? Anybody see refs of any kind? Or the same kind?

Also, on what I presume must be a related front --

2.) Google Web Tools (GWT)

If it's been a while since you checked your site(s) via Google's "site:yourdomainhere.com" feature (alt: "site:yourdomainhere.com yourdomainhere"), DO. IT. NOW.

This morning I discovered literally thousands of thou-shalt-never-crawl files in a score of thou-shalt-never-crawl directories wide-open in the SERPs despite being in robots.txt for years, and despite blocking Googeblot and numerous Google UAs, and googlebot.com, and google.com via dir-level .htaccess for years, just in case. That leaves only one UA from bare Google IPs...

The first domain I checked was a big one. Then I checked a much smaller one and again found loads of robots.txt-disallowed files and dirs. (I'll tackle other domains when my eyes aren't crossed from cross-checking to and from too many windows and tabs.)

It took me over an hour to systematically use GWT's URL Removal Tool for the www. form of the domains. Here's hoping I don't have to duplicate my efforts for the non-www version of sites. Here's hoping you don't find the mess I did!

This 66 message thread spans 3 pages: < < 66 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved