homepage Welcome to WebmasterWorld Guest from 54.227.25.58
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 74 message thread spans 3 pages: < < 74 ( 1 2 [3]     
And Now Google's Doing It. JS Stats Show GoogleBot
TheMadScientist




msg:4312060
 7:08 pm on May 13, 2011 (gmt 0)

Hmmm ... Well a couple weeks ago I noticed M$ showing up in my JavaScript stats, and today I've got GoogleBot doing the same thing ... They're not quite as far along as M$ (or choose not to be) because they're missing the jQuery grab of the variable from the source code of the page, and they're really F'ing up my stats and they're COMPLETELY disregarding the robots.txt ... I think G's about to get banned from the files I don't want them to call in the .htaccess.

This is the first time in nearly 9 years I've seen G blatantly disregard robots.txt and they're doing it with a GoogleBot UA. I've given these guys the benefit of the doubt on not crawling robots.txt so many times it's not even funny, but this JS file is disallowed, so there's no way a GBot user-agent should be in my JS Stats.

 

TheMadScientist




msg:4313774
 7:14 pm on May 17, 2011 (gmt 0)

Given one page to examine isn't really a crawl...

Okay, but that's not what they're doing from the information we have posted in this thread even...

They may be claiming that's what they're not doing, but lucy24 noted in this thread previews were activated and 100 requests were made, including requests for linked hypertext documents and the files included on those pages, which makes it a 'crawler', and subject to 'protocol' imo.

And, what they really do is present users with a 'list' of pages for them to crawl. (They're requesting all 10 results from a single click as far as I can tell.)

They do not, as far as I can tell through use, or what they say, or even the data we have posted here, simply request the preview for the page in the results a user clicks the magnifying glass for, they request all 10 (or however many results are shown on a page + whatever subsequent hypertext documents they request) as soon as a user turns on previews.

'At the request of the user', imo, would be when a user clicks preview next to a page (where they put the magnifying glasses on the results) they request that page, but they're requesting (from what's been reported here and based on usage) more than the single page the user clicks the glass next to.

ADDED: They can 'spin it' and present it in any way they like, and if 'breaching protocol' were 'not a big deal', imo, then they wouldn't bother to 'spin it' or 'play it off' or 'candy coat it' at all, because there would be no need, but they do.

[edited by: TheMadScientist at 7:24 pm (utc) on May 17, 2011]

incrediBILL




msg:4313777
 7:22 pm on May 17, 2011 (gmt 0)

They may be claiming that's what they're not doing, but lucy24 noted in this thread previews were activated and 100 requests were made, including requests for linked hypertext documents and the files included on those pages, which makes it a 'crawler', and subject to 'protocol' imo.


That could simply be prefetch enabled in the browser they're using by default, a simple but painfully bandwidth abusive oversight most would catch.

Maybe I don't see that behavior because I have prefetch blocked at the server level.

Prefetch is still not a crawler, it's a browser function, something not deemed to beholden to robots.txt either.

TheMadScientist




msg:4313781
 7:26 pm on May 17, 2011 (gmt 0)

Prefetch is still not a crawler, it's a browser function

That's interesting ... Will you explain why that is what it's considered?

Oh, never mind! I miss read.

I was thinking we were still on the 'Web Preview Bot', which I don't think could be considered a browser function, but you said 'prefetch'. They might try to 'classify' the 'Preview Bot' as a pre-fetch mechanism, but if that's the case, then even though it may not be an 'officially adopted standard' I would think they should send an X-Forwarded-For and not cache the data.

I know someone posted earlier about the data not seeming to be cached when served a 403, but that's likely different behavior than when they actually get the contents of the URLs ... If I wanted what was contained, I would not cache a 403 either, because then (if) as soon as the 403 is removed I would have the data.

I'm sure they're 'nice enough' to cache and store the data internally for bandwidth reason, but ... They're still 'crawling' by doing it, imo.

[edited by: TheMadScientist at 7:57 pm (utc) on May 17, 2011]

incrediBILL




msg:4313784
 7:30 pm on May 17, 2011 (gmt 0)

I was thinking we were still on the 'Web Preview Bot', which I don't think could be considered a browser function, but you said 'prefetch'.


Because web preview bot would require a browser, and I didn't make the decision, blame FireFox and Google for nasty old prefetch.

It was designed to speed up browsers by pulling the next logical set of pages into cache ahead of the user request based on links on the page, hardly a "crawl" since it's limited to a few links from the current page only.

If the web preview bot's browser started making a series of screen shots following that path it's behavior would in fact appear to be crawling to the outside observer, seen it happen before, figured it out, dismissed it as non-crawl.

I could be wrong but it's my best guess based on past observations of similar behavior.

YMMV

TheMadScientist




msg:4313789
 7:37 pm on May 17, 2011 (gmt 0)

Because web preview bot would require a browser, and I didn't make the decision, blame FireFox and Google for nasty old prefetch.

LMAO!

Yeah, there are a bunch of 'hair splitting' arguments, but for a company which states 'caution is important' (Amit Singhal), they don't seem to be 'erring on the side of caution' and 'generally accepted protocol' with this one ... They seem to be erring on the side of 'throw protocol out the window, do as we please and word it just right, so we can still claim to "do the right thing."' My opinion only, of course...

[edited by: TheMadScientist at 7:44 pm (utc) on May 17, 2011]

incrediBILL




msg:4313794
 7:43 pm on May 17, 2011 (gmt 0)

That's because G has many dichotomies which don't always mesh well and even various components of the SE itself, including googlebot and this web preview seem to be at odds with each other.

For instance, my entire site is NOARCHIVE, wouldn't a screen shot qualify as an archive?

HELLO!

TheMadScientist




msg:4313795
 7:46 pm on May 17, 2011 (gmt 0)

For instance, my entire site is NOARCHIVE, wouldn't a screen shot qualify as an archive?

A reasonable person would think so, imo, but I'm not so sure about the Rocket Scientists and/or PhDs they employ ... My guess is they have a 'different interpretation' of the directive ... Slightly more 'nuanced', so to speak. lol

TheMadScientist




msg:4313814
 8:25 pm on May 17, 2011 (gmt 0)

I think I'll leave this one with the following.
My opinion only, of course...

If the Robots Exclusion Protocol and adhering to it were 'not a big deal', whether it's an officially adopted standard or not, then there would be no need to 'explain away' requests for disallowed files, why they may happen, or 'blame them on the user', but Google does:

As on-the-fly rendering is only done based on a user request (when a user activates previews), itís possible that it will include embedded content which may be blocked from Googlebot using a robots.txt file.

If they didn't want or use the information internally, then why would they cite their cloaking page as a reference to make note of for showing different versions of the content to Google's Web Preview Bot and the real GoogleBot*, and why would they not give a simple, easy way to block the previews without blocking a 'description' (snippet) also?

* And if they don't cache, keep or use the information, because it's 'at the request of the user', then how would they know if it was cloaked or not? (You can't tell unless you compare two versions of it.)

No. You must show Googlebot and the Google Web Preview the same content that users from that region would see (see our Help Center article on cloaking).

They can somehow show the 'snippet' when previews are not activated, and they could show the 'snippet' before the preview system was in existence, but now they can't figure out how to show a snippet if you don't want a preview available? I don't buy it.

You can block previews using the "nosnippet" robots meta tag or x-robots-tag HTTP header. Keep in mind that blocking previews also blocks normal snippets. There is currently no way to block preview images while allowing normal snippets.

I know, I know, that Robots Exclusion Protocol is merely a guideline and it's not 'official web standard', and it's really not even breaking it the way they do it, because they 'don't crawl the web', and when they seem to maybe crawl the web and maybe stretch or break the 'unimportant protocol' a little bit, it's 'at the request of the user', so it's not really a big deal or even really them doing it ... IMO: That's a Great Spin!

https://sites.google.com/site/webmasterhelpforum/en/faq-instant-previews

lucy24




msg:4313821
 8:56 pm on May 17, 2011 (gmt 0)

For the insatiably curious, I've dumped the relevant 100-fetch chunk here [lucysworlds.com]. It's a 27k text file, unedited except that I changed the line endings from LF to CRLF. Of course you'll have to take my unsupported word for it that I didn't fake it up in a text editor, since I'm not about to let y'all into my actual log area ;)

I should add that this visit was anomalous; for other days that I checked, Web Preview visits were strictly limited to files associated with a single page. That's why I initially assumed they were generated in response to an individual user request.

dstiles




msg:4313864
 10:27 pm on May 17, 2011 (gmt 0)

I could cause 100 web preview hits easily enough: set number of results/page to 100 and do a site: or whatever search, which should return 100 pages from the one web site. Plus, of course, css, js, images etc. Or even set it to 10 and page down 10 pages (or more). Why? Competitor, perhaps? In my own experience I do not think that web preview recurses pages.

My own experience of web preview is that it is not technically a bot. It's a proxy. Google's IP is the "proxy server" interfacing the user's own IP (usually broadband) (the user's IP shows up in the Forwarded For header). This means that a user does a search and google - on the user's behalf - fetches the previews that are not already in its cache (ie its serps database, not its webcache as prohibited by the nocache directive). All the hits I see for web preview are of this "forwarded for" proxy format, usually only a few at a time, as one would expect for a normal search for a word or phrase.

TheMadScientist




msg:4313884
 11:26 pm on May 17, 2011 (gmt 0)

This one is the 'real issue' I have with the 'at the request of the user' statement ... X-Forwarded-For or not ... Crawler or not ... Anything else or not:

No. You must show Googlebot and the Google Web Preview the same content that users from that region would see (see our Help Center article on cloaking).

If it's 'at the request of the user', and they're simply a proxy, then how would they know if a page is cloaked or not? (You can't tell unless you compare two versions of a page.)

Oh, while we have the information here for the visitor, let's take a lil peak, and maybe some other user will want to look, so let's keep it for a bit ... That's not a 'simple' and 'innocent' proxy request.

Where do you get two versions to compare in the 1st place when the files are disallowed, which is why they say they request disallowed files with the Preview Bot UA?

Something about their whole sales pitch around this bot stinks, imo.
(Yeah, I came back for one more. lol)

[edited by: TheMadScientist at 11:46 pm (utc) on May 17, 2011]

lucy24




msg:4313891
 11:46 pm on May 17, 2011 (gmt 0)

I could cause 100 web preview hits easily enough: set number of results/page to 100 and do a site: or whatever search.

Your relationship with google must be very different from mine. I tried it two ways. If I simply pull up 100 results, nothing gets loaded into Web Preview. If I pull up the 100 results and then click on Preview from the first listed page-- which happens to be my never-visited front page-- I get a quick flurry of 31 hits in, amusingly, the same six seconds noted elsewhere. All from the identical UA, but the IP starts at 64.233.172.17 and then starts drifting around to others, mainly 74.125.75.20, when it moves to other pages.

To be exact ("page" always means "page and all associated files"):

A: main page-- the one I asked to Preview

B, C: two of the six pages linked from the main page. Why those two? I have absolutely no idea.

D: one page located at least four links away (five by an alternative route) from page C. This page is in a no-index directory-- a fact the googlebot assimilated several days ago. Since users will never see it in search results, why does Web Preview need it?

For comparison purposes, the 100-hit search consisted of:

E: main page

F: page linked from main page (not the same as B or C above)

G,H,I,J: pages located two links away from main page, none of them via page F

blend27




msg:4313900
 12:16 am on May 18, 2011 (gmt 0)

so it's not OK to use automated tools to query Google, but it's OK for WebPreview Bot do Automated requests to the files that Are dissalowed in Robots.txt?

it is not technically a bot. It's a proxy.


And my script says that if the requests are made via proxy and the rate higher than 5 in 2 second, user gets banned for an hour, busta.

Just for this:

I could cause 100 web preview hits easily enough.


And I don't loose any sleep over it.

dstiles




msg:4314369
 10:04 pm on May 18, 2011 (gmt 0)

I'm not saying it's ok - I am very much anti-google nowadays (from being a strong supporter in the early days). I'm just saying what I see and how I understand it to work. :)

TMS - Ever heard of a bluff? Like in poker? Google says you must always show them the same as you show a customer. As long as you do not incur their wrath in any other way they are, I suspect, unlikely to check. And in any case there are legitimate cases for presenting different content - SEs already have more than we want them to have.

And I agree - the whole of google stinks right now and has for at least the past two years.

Lucy - I think you are seeing google's reaction to your request. If no one hovers a preview then there is no need for a site request (with JS turned off I see neither a preview nor a request). If google already has your page (including images etc) as scraped earlier by a true googlebot OR successful web preview then it shouldn't (according to google) go to your site for more information. If it does not have images because they are disallowed in robots.txt then google will try to fetch them WITHOUT checking robots.txt.

In my case they are mostly out of luck, except for one customer who wanted complete previews; in his case I've removed the images ban from robots.txt.

I think google fetches all previews (that it hasn't got) as soon as one has been hovered. Pretty dumb from a site owner's viewpoint (analogous to the ghastly AVG mess a year or so back) but obviously quicker for the user. Which pages it tries to fetch depends on what is in the SERPS, which quite likely varies from search to search.

An aside: I've been using startpage by ixquick for some time now as my primary SE since dumping google as an SE. Startpage has now dropped all of its meta search results in favour of google results. Ixquick, on the other hand, still returns a meta search excluding google, so guess which version I'm now using. Until that too goes google. :(

(When is a vacuum cleaner a washing machine? When google says it is! That happened to my wife today when researching a customer's site!)

This 74 message thread spans 3 pages: < < 74 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved