homepage Welcome to WebmasterWorld Guest from 54.81.170.186
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 66 message thread spans 3 pages: 66 ( [1] 2 3 > >     
Google Web Preview
Mokita




msg:4223020
 10:02 pm on Oct 27, 2010 (gmt 0)

Has this been mentioned here previously? I couldn't find anything in a search.

Found it crawling one of our sites last night - thought it odd, as it was coming from the 66.249.64.0/19 range normally used by googlebot.

The full UA is:

Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

Looking for information, I found this:

Google has been caught testing a major new layout to their search results full page previews of the target site and pale blue backgrounds behind the search results when you hover over them.
...
One of the fascinating things about this is that they are highlighting certain sections of the page in orange and expanding the text to provide a snippet of information. This shows that they have the technology to know exactly where a piece of text is on every single web page. The snippets highlighted are not always the same as the snippet in the search results.


The obvious question raised by this, is the effect it will have on click-through rates.

 

Pfui




msg:4223131
 3:43 am on Oct 28, 2010 (gmt 0)

Another issue has to do with copyrights. And, of course, NOINDEX tags, in addition to regular access/indexing controls. (sighs) My jury's still out on the benefits.

The screenshot on the site you cite -- [webfreelancer.net.au...] -- shows the preview is a heckuva lot larger than the average thumbnail. See the original example larger here: [blogstorm.co.uk...]

Mokita




msg:4223140
 4:08 am on Oct 28, 2010 (gmt 0)

Pfui wrote:
Another issue has to do with copyrights. And, of course, NOINDEX tags, in addition to regular access/indexing controls. (sighs)


No evidence here (as yet) that it violates NOINDEX tags. Copyright seems to be a non-issue where major search engines are concerned <grr>.

My jury's still out on the benefits.


Benefits or not - looks like we will have to wear it, coming from the all-powerful Google machine :-/

The screenshot on the site you cite -- [webfreelancer.net.au...] -- shows the preview is a heckuva lot larger than the average thumbnail. See the original example larger here: [blogstorm.co.uk...]


Hmmn, I thought it was against the Webmasterworld TOS to post links like those, otherwise I would have posted them myself.

Pfui




msg:4223266
 8:54 am on Oct 28, 2010 (gmt 0)

I'm not sure how strictly other forums here handle the link guidelines. In this one, I've found that links to very specific observations, like those two sites with details and a screenshot, tend to pass muster. (As opposed to more personal diary-like blog musings.) And particularly when there's precious little info elsewhere.

So about Google's new, cloaked bot running from bare Googlebot IPs -- how did it behave, please?

- Did it read/heed robots.txt? Or did it appear to 'share' prior Googlebot robots.txt hits?
- Did it hit all kinds of files, launch JS, etc.? (If you have numerous filetypes allowed in robots.txt. I don't.)
- Did it crawl in typical a Googlebot pattern/rate?
Etc.

TIA for more Tales from the Server-Side:)

dstiles




msg:4223587
 8:53 pm on Oct 28, 2010 (gmt 0)

The only hits I've seen so far occur with IPs that have been banned for running feedfetcher & translate on empty-DNS IPs.

First occurrence: 26th October

Mokita




msg:4223617
 10:33 pm on Oct 28, 2010 (gmt 0)

Did it read/heed robots.txt? Or did it appear to 'share' prior Googlebot robots.txt hits?


Didn't ask for robots.txt. Can't tell if it heeds it, as it only took files that are normally allowed to human visitors. I have very few that are disallowed. But it certainly wasn't heeding the Disallows for search engine bots, as I don't allow them to index images, scripts or CSS.

Did it hit all kinds of files, launch JS, etc.?


Yes, took all supporting files - images, CSS and JS. Can't tell if it launches JS - I only use it for tabbed content.

Did it crawl in typical a Googlebot pattern/rate?


No.

It has visited three of the sites I control (a minority). In all cases the pages were "deep". Didn't take home page or first level (category) pages.

First visit was 22 Oct, followed by 26 Oct (twice, 17 hours apart) for that site. IP was 64.233.172.n but oddly it fetched a few images using 74.125.75.n.

For the site mentioned in my OP, it has visited twice, on 24 and 28 Oct from 66.249.82.nn only.

On the last site, the behaviour was different again. It requested three closely related pages and their images in the same second on 26 Oct from 74.125.74.nnn. It returned on 28 Oct and crawled only one page from 74.125.152.nn.

misterjinx




msg:4228784
 12:36 pm on Nov 10, 2010 (gmt 0)

This is the Google Instant Previews web crawler

Pfui




msg:4228900
 6:00 pm on Nov 10, 2010 (gmt 0)

About the UA Mokita spotted --

Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

-- the Help page was updated yesterday (editors must have been sleepy):
>>
Instant Previews are page snapshots that are activated by clicking on the magnifying glass icon [img]; they allow users to get a glimpse of the layout of the web pages behind each search result, in order to help them decide whether or not to click a link.

Instant Preview[sic] are extremely useful to users and can help them decide whether or not to click on your site in the search results. You can, however, specify that Google should not display Instant Preview for your page, in which case neither the text snippet nor the preview will appear. ...
<<
Source: Google's Webmaster Tools Help [google.com...]

Related (also updated yesterday): Removing snippets and Instant Preview [google.com...]

So if you want a snippet, you're stuck with allowing yet another Google UA, and a screen shot 'borrowed' full-time for on-demand.

dstiles




msg:4228943
 9:03 pm on Nov 10, 2010 (gmt 0)

Reading the GWMT Help page:

"Google updates the Instant Preview snapshot as part of our web crawling process. Google also uses the user-agent Google Web Preview (Mozilla/5.0..."

Does this mean that the original preview is created by googlebot or by preview bot?

If the latter is blocked using 403, will snippets still be shown?

This looks like over-a-barrel time!

londrum




msg:4228946
 9:08 pm on Nov 10, 2010 (gmt 0)

its a bit worrying for anyone with a bot trap.

if you're currently blocking spiders from crawling stuff like images and javascript in robots.txt then they are obviously ignoring it. because i am doing that to... yet the picture shows all the images and javascript-created text intact.

which means the spider could also crawl the bot trap and get sprung.

dstiles




msg:4228968
 9:49 pm on Nov 10, 2010 (gmt 0)

I block the preview bot (just checked and it gets a 403).

My wife just checked google UK and the preview is working in the UK now. Ghastly thing! After the test she turned off javascript again.

So, since I'm blocking the preview bot the info is coming from googlebot.

One of our client sites (at least) looks terrible: we've blocked furniture images from google in robots.txt (for most of our sites, in fact) so only the text is shown. I would not click on the page so doubtless the site will suffer.

On the other hand... one of our own sites blocks both img (furniture) and pics (topic photos) and the whole site is displayed IN FULL! So someone is disobeying robots.txt (and yes, it is correct!).

keyplyr




msg:4229007
 12:39 am on Nov 11, 2010 (gmt 0)

if you're currently blocking spiders from crawling stuff like images and javascript in robots.txt then they are obviously ignoring it...

I don't think it's necessary to request the individual files. I think Google Web Preview bot is just taking a snapshot of the page.

Mokita




msg:4229016
 12:56 am on Nov 11, 2010 (gmt 0)

@keyplyr

Google Web Preview is requesting all supporting files every time someone clicks on the preview icon.

keyplyr




msg:4229019
 12:59 am on Nov 11, 2010 (gmt 0)

@Mokita My logs don't show that at all. It shows the bot crawling and that's it. If the Preview function requested these files each and every time a user accessed it then I'd see these file requests all through my logs and I don't.

dstiles




msg:4229272
 5:52 pm on Nov 11, 2010 (gmt 0)

As I said, I have preview bot blocked with 403. They are either coming at the site through the punter's IP or are sucking via googlebot OR through an unrecognised IP using a "real" browser identifier.

The latter may be true. The punter option seems more likely EXCEPT I can't see any proxying of the punter's IP so if that is true they are also falsifying the source IP, which I can't see happening unless they have become really devious!

As noted above, at least one of my sites (make that at least 2 now!) shows pics and furniture when it shouldn't.

A client's site has pics in cache view but not in preview EXCEPT this site did not block pics until quite late (probably May 2009) and those images ARE shown, even though this breaks the recommendation in robots.txt. This is difficult to determine absolutely since pics on some pages are old and some new.

Another client site shows pics even though robots says not to BUT only on some pages. These pics (AND furniture) have always been disallowed but again are in cache view (so google has been breaking robots.txt protocol for some time... never thought of that before regarding cache).

I'm guessing here that the missing pics are probably due to google not having scraped them yet.

Another client site we run has several iframes per page (not on all pages). This seems to have caused only minor problems to preview, which shows the full iframed page WITH contents for specific keywords (but not (always?) the pics but always the furniture). Furniture is shown but not product pics on the pages we've seen so far but again may be scraped yet for preview.

Pfui




msg:4229281
 6:12 pm on Nov 11, 2010 (gmt 0)

@dstiles: Do the sites showing pics w/o permission also have NOSNIPPET meta tags?

Ironically, I want the snippet descriptions (not the snap shots), but even without NOSNIPPET tags, I can't figure out why G doesn't show any at all. Hmm. I guess it's just as well, seeing as how the snippet and snap shot apparently go hand in hand now.

Hey, I got it! In addition to NOSNIPPET, how 'bout NOSNAPIT? ;)

dstiles




msg:4229356
 9:46 pm on Nov 11, 2010 (gmt 0)

Nosnippet is too new - only heard about it yesterday. And it's useless anyway since it also blocks display of snippets as well as pics.

I think NOTHEFT would be a better one. :)

Samizdata




msg:4229389
 11:48 pm on Nov 11, 2010 (gmt 0)

Perhaps give the Google Web Preview UA a 301 redirect to a generic advertisement for the site.

One that prominently features the words "Copyright Reserved" might be suitable.

...

dstiles




msg:4229509
 10:18 am on Nov 12, 2010 (gmt 0)

But how to determine the UA/IP? If I knew the IP I could block it. If it's a "browser" UA on a true bot IP I block it anyway so it's not that.

So, is it a browser on a google non-bot IP or an IP from another farm? Since I block most farms I should be blocking that in any case, although I'm still finding new ones every day (viruses do have their uses!).

Samizdata




msg:4229520
 10:48 am on Nov 12, 2010 (gmt 0)

But how to determine the UA/IP?

As mentioned in the opening post, te user-agent is:

Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

Seen here coming from 74.125.66.nn (possibly what you refer to as an "empty-DNS IP").

...

londrum




msg:4229535
 11:48 am on Nov 12, 2010 (gmt 0)

is there way to allow the google preview bot to spider your images in robots.txt?

i know that adsense/adwords has got it's own bot name that we can block. presumably we can do the same with this new preview thing too.

dstiles




msg:4229759
 10:33 pm on Nov 12, 2010 (gmt 0)

@Samizdata - I was going to say "No, it's not, the preview bot is blocked" but I notice this is only true for real bot IPs. I'll try blocking the UA. Thanks.

Pfui




msg:4230046
 9:05 pm on Nov 13, 2010 (gmt 0)

Another bare IP from G, and on a site where all Things G are wide open (client preference), the new GWP UA took everything on two pages in seven seconds: HTML, CSS, JS, JPG, GIF, PNG:

66.249.82.66
Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

robots.txt? NO

But wait. There's more.

G's results for the company name and the domain now ONLY show the two pages hit by the GWP UA, both with the preview.

Plus the preview shows a highlighted/magnified section of its own choosing, looking like it's in/on the previewed page, but it's not. So not only do they 'take' a snap shot of your pages, they alter them, too.

Sheesh. When G picks, prunes, and re-presents other sites' info for you, why ever go elsewhe-- oh, forgot. That's the point.

So what about the site's other pages, like the Welcome page? Gone. How about all the other pages in the sitemap GWT says are included in results, none with NOSNIPPET tags? Gone. All GONE. Dammit.

dstiles




msg:4231222
 11:20 pm on Nov 16, 2010 (gmt 0)

For reference, the following IPs accessed my server as "Google Web Preview" between the dates indicated. I doubt they will be the only ones.

I originally thought some of them had legitimate googlebot rDNS. I was mistaken: they were allotted "bot" status by me on the grounds they were used by google for some other purpose (see notes below).

The IPs had no rDNS at the time of writing.

IPs trapped between 01-Nov-2010 00:00 GMT and 16-Nov-2010 22:30 GMT

66.249.82.1
66.249.82.2
66.249.82.66
66.249.82.129
66.249.82.199

66.249.85.65

72.14.194.33

72.14.202.80
72.14.202.85
72.14.202.86

74.125.152.80
74.125.152.81
74.125.152.82

74.125.158.81

(feedfetcher, translate etc also use these...)

64.233.172.1
64.233.172.6
64.233.172.17
64.233.172.18
64.233.172.20

74.125.16.1
74.125.16.2
74.125.16.3
74.125.16.65
74.125.16.66
74.125.16.68

74.125.74.129
74.125.74.130
74.125.74.131
74.125.74.132
74.125.74.194
74.125.74.195
74.125.74.196

74.125.75.1
74.125.75.3

(google verification bot uses this...)

72.14.194.17

Given the general re-use culture prevalent at google for some years now, what chance do we stand of identifying the preview bot accurately? We have NO guarantee it's what it claims to be.

incrediBILL




msg:4232110
 7:13 pm on Nov 18, 2010 (gmt 0)

if you're currently blocking spiders from crawling stuff like images and javascript in robots.txt then they are obviously ignoring it. because i am doing that to... yet the picture shows all the images and javascript-created text intact.


The previews didn't contain any images from an image directory that I block from all Google spiders, so it's smarter than just a regular screen shot tool.

Anything coming from Google other than the regular spiders are simply treated like any other browser which is why Google has many thousands of pages of my CAPTCHA displayed as a preview when it requested too many pages as a browser :)

dstiles




msg:4232134
 8:34 pm on Nov 18, 2010 (gmt 0)

I'm not sure it IS smarter, Bill. SOME of our sites have previews with a full set of pics and others don't. All have the images and pics folders disallowed in robots.txt. It MAY be possible that the preview bot scraped them before I got around to completely blocking it but that's a lot of pages - and yes, I know your take on whitelisting! :)

On one of their pages google says it's possible to allow pics in robots.txt and then disable them per-page using the X-Robots-Tag meta, but I can't see that working and there are no examples. And, of course, that would mean feeding google a different robots.txt than all other SEs.


I really wish I could block google now! :(

Lovejoy




msg:4232144
 8:59 pm on Nov 18, 2010 (gmt 0)

I'm noticing a big uptick in sales since the preview went online, on the preview my site shows up perfectly. As I've set up my main page with bullet points, it looks very easy to navigate even in the preview. This might make it more appealing for searchers to actually decide which site looks like it might fill the bill over another that looks like a bill board for affiliates......

incrediBILL




msg:4232158
 10:06 pm on Nov 18, 2010 (gmt 0)

All have the images and pics folders disallowed in robots.txt


Once upon a time I also went into G's WMTs and told them expressly to remove all content from the image folders as well.

Maybe that's the difference.

However, screen graphics for the page layout are allowed and stored in a different folder.

dstiles




msg:4232182
 11:33 pm on Nov 18, 2010 (gmt 0)

I will probably end up moving the furniture, though I don't see it's any business of google. I'll keep an eye on our "got everything displayed" sites and see if they lose it.

Anyone know if I can charge all this work to google? No, thought not. :(

Pfui




msg:4232215
 2:31 am on Nov 19, 2010 (gmt 0)

FWIW, GWP just hit one site's root one time, but using three IPs in 2 secs:

74.125.112.81
66.249.85.1
74.125.112.84

UA for all: Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13

robots.txt? NO

Interestingly GWP only hit root this time, not the two other pages it hit before. And this time, only html and page-unique jpg, not css, gif and png as before.

That pattern sure suggests it's cacheing site-wide css and js.

Last but not least...

Googlebot is disallowed from anything but .html -- and GWP has taken everything but .pdf (thus far). So it appears GWP doesn't regard itself as Googlebot. Or as ANY bot for that matter because it doesn't request robots.txt on its own. If it had, it ideally should've heeded:

User-agent: *
Disallow: /

[edited by: Pfui at 2:43 am (utc) on Nov 19, 2010]

This 66 message thread spans 3 pages: 66 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved