Google Fetch & Render vs Google Cache

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Fetch & Render vs Google Cache

shaunm

1:18 pm on Aug 4, 2015 (gmt 0)

Hello All,

I was looking through the Google fetch in GSC and found that some of the images, including the logo, top navs images & banner images are blocked by robots.txt file. In other words, Google shows an almost text only version of my pages under 'This is how Googlebot saw the page' in GSC.

But when I check for cached version of those pages in Google search, the pages are appearing normal. It even reads This is Google's cache of 'the page'. It is a snapshot of the page as it appeared on 2 Aug 2015 21:18:22 GMT. The current page could have changed in the meantime.

I'm not really sure how blocked resources can be available in Google search. Can you please explain?

Thank you all!

TheMadScientist

2:46 pm on Aug 4, 2015 (gmt 0)

They use the HTML source for the cache, so they're not actually requesting/displaying the page, the images or anything else, they're giving the browser the source-code they have for the page and letting the browser "do what browsers do" with source-code, which is display + request/display dependent files.

Short Version: The visitor is requesting the files when they view the cache. Google is simply providing the HTML to the browser.

For Fetch & Render Google is trying to get the page and dependent files, so if there's a block in robots.txt they honor it and don't fetch the dependent files.

Dymero

9:14 pm on Aug 4, 2015 (gmt 0)

Google now renders the site exactly as how you'd see it in a browser for indexing and ranking purposes, and it can't do that if JS and CSS files are hidden.

jimbeetle

10:02 pm on Aug 4, 2015 (gmt 0)

Be sure to see this thread [webmasterworld.com] about Google issuing warnings in GSC about resources blocked by robots.txt.

shaunm

9:31 am on Aug 5, 2015 (gmt 0)

Thanks @TheMadScientist,

They use the HTML source for the cache, so they're not actually requesting/displaying the page, the images or anything else, they're giving the browser the source-code they have for the page and letting the browser "do what browsers do" with source-code, which is display + request/display dependent files.

If I can understand you correctly, are you saying that the browser requests the dependency files as I click on 'cached' beneath a page in the search results? If that's the case, how come the cached version, not obviously, have the older version of the page if the browser is actually making a live request to the server?

I am confused since Google says "The current page could have changed in the meantime" which means the everything on the cached version is not live. What do I miss here?

Thans @Dymero & Jimbeetle :-)

Robert Charlton

10:32 am on Aug 5, 2015 (gmt 0)

...Google says "The current page could have changed in the meantime" which means the everything on the cached version is not live. What do I miss here?

The cached version in fact is not live. It is a partial copy of your page, stored on the search engine's server, as kind of a reference copy and a backup. The backup can be useful if your live site is down.

Search engine caches can persist unchanged several weeks, and they're usually served from a different data center than the serps are served from. By their nature, caches must be slightly behind live results, since there's some time involved in indexing and saving them. If a cache is kept online unchanged for several weeks. then it's prudent of Google to let you know that the current page, or parts of the current page, could have changed during that interval.

It's been a while since I've noticed a cache with changed images, out of sync with those on the web server, but I assume that's possible. If the code for the image changes is not yet updated in the cache, the spots for the images simply would show as empty, or whatever the mismatch might be. .

The current site, by contrast, is not on Google's server... it's accessed by a link from Google to the site on your web server... so the content it displays would always be the most current.

TheMadScientist

11:48 am on Aug 5, 2015 (gmt 0)

What Robert_Charlton said, and here's a way to see what he's saying using a thread we know should change:

Go to this thread [the Aug. Update Thread]: [webmasterworld.com...]

View Source > Copy > Create a New File in Your Favorite Text Editor > Paste > Save As google-4760634.htm -- That's a Cache of the Aug. Update Thread page.

Open the file you saved [your cache of the page] in your favorite browser and you'll see the same thing as you do on the live site, including the images, even though you didn't download those by copying, pasting, saving the source code -- Your saved page [cache] is "in sync" with the live version.

Wait a week -- Open the file you saved as google-4760634.htm [your cache] in a browser again and then view the live thread in another tab. They'll be different, because your saved page [cache] won't contain any posts made in the thread between the time you saved it originally and the time you viewed it again a week later -- Your saved version [cache] will be "out of sync" with the live version.

If, in a week, you copied/pasted/saved the source from the thread again and overwrote your saved version [cache] you would have "refreshed your saved version" [refreshed your cache] and it would be the same as you see on the live site again, so your saved copy [cache] would be "synced" and it would stay "in sync" with the live version until another post is made in the thread, then your saved version [cache] would be "out of sync" and need to "be refreshed" [re- copied/pasted/saved] to be "in sync" again.

TheMadScientist

12:00 pm on Aug 5, 2015 (gmt 0)

ADDED: Just viewed the source here and you might not see the images since they don't have a <base href=> on the page and the links for the images and dependent files are relative, so if you want to see them you may need to add a <base href=http://www.webmasterworld.com/> to the <head></head> of your "cached version" of the page.

shaunm

6:22 am on Aug 6, 2015 (gmt 0)

Thanks @Robert Charlton & TheMadScientist for the detailed explanation. I can pretty well understand the concept but what I'm failing or struggling to understand is where does the browser get the resources from?

Google only provides the HTML source of the page that it saved sometimes in the past (which might have changed now) to the browser and the browser needs to render the page by requesting all those CSS, JS and Images might be put on the page. If it's not making a live request to the sever for the dependent files, where does it gets the files from? From its local disc? What if I've disabled caching of files for the browsers?

Thanks again!

lucy24

6:42 am on Aug 6, 2015 (gmt 0)

It's been a while since I've noticed a cache with changed images

I remember noticing image mismatches when I was experimenting with image search, which involves a different type of caching.
:: detour to Google ::
By pure coincidence, I made a visible change on almost all pages just a day or two ago, though that isn't what I wanted to check. The page I picked first claims to have been cached on 28 July.
:: further detour to logs ::

{ my IP } - - [05/Aug/2015:23:36:18 -0700] "GET /hovercraft/images/glassplate.png HTTP/1.1" 200 2601 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ my browser }" 
{ my IP } - - [05/Aug/2015:23:36:18 -0700] "GET /hovercraft/images/eel.png HTTP/1.1" 200 1445 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ my browser }" 
{ my IP } - - [05/Aug/2015:23:36:18 -0700] "GET /hovercraft/images/ducttape.png HTTP/1.1" 200 1912 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ my browser }" 
{ my IP } - - [05/Aug/2015:23:36:18 -0700] "GET /hovercraft/images/thumbs/smallkabloona.jpg HTTP/1.1" 200 5095 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ my browser }" 
{ my IP } - - [05/Aug/2015:23:36:18 -0700] "GET /hovercraft/images/thumbs/smallbandplayed.jpg HTTP/1.1" 200 3781 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ my browser }"

et cetera. No CSS requests, but that's because it was already in my browser's cache. (I tried another, more obscure page, from a different directory, to cross-check. How obscure? Sooo obscure, the datestamp on that cache was mid-May. Ouch.)

Huh. Dang, TMS, do you know, I'd never realized that Google's cache is only HTML. I'd always assumed it was a full archive. Interesting.

TheMadScientist

12:45 pm on Aug 6, 2015 (gmt 0)

If it's not making a live request to the sever for the dependent files, where does it gets the files from? From its local disc? What if I've disabled caching of files for the browsers?

Q1.) From where the HTML source tells it to, which generally is your site -- If you're using a CDN or files from another site or something along those lines then that's where it'll try to get them, but the "bottom line" is the *browser* will request them from where ever the HTML tells it to, which is not Google, unless you have src=http://google.com/some-file.ext in the source of your HTML for some reason. [* The Browser making the request instead of Google is an important distinction -- See Below]

Q2.) Yes, if the *browser* has a copy that's not "expired" or if it's not told to "not cache" the resource(s) [images, css, js, etc.] it will make a request to your site for the dependent files.

Q3.) If caching is disabled, the *browser* will request them from your site if it's doing what it's told [sometimes getting some browsers, especially IE, to "do what they're told" can be challenging]

The distinction between the browser making the request and Google making the request is the main difference between why the cached version will show images, css, js even Google is blocked in robots.txt, but fetch & render won't -- It comes down to which is making the request for the files: The Browser or Google.

-- For the cached version of a page, the browser, not Google, makes the request for the dependent files, so robots.txt doesn't apply. A browser doesn't check for robots.txt or even know it exists, unless you try open http://example.com/robots.txt in a browser window.

-- For fetch and render, your browser requests a page from Google much like viewing the cached version *but* when Google gets the request, instead of just sending the HTML of the page and letting your browser request the dependent files, *Google* tries to get a fresh copy of the HTML *and* all the resources to put the page together itself to show you what *Google* got as a result of the requests, not what you would get if your browser made the requests itself -- Google uses a bot [Googlebot], not a browser, to get the resources, so Googlebot checks for robots.txt and if the robots.txt says Googlebot is blocked from certain resources, then it doesn't get those resources, because it's been told not to.

TL;DR

Cached Version:
The browser makes the request(s) for the resources of an HTML page.
No robots.txt blocks checked or applied.

Fetch & Render:
Googlebot makes the request(s) for the resources of an HTML page.
Robots.txt checked and obeyed.

lucy24

6:20 pm on Aug 6, 2015 (gmt 0)

if the *browser* has a copy that's not "expired" ... it will make a request to your site for the dependent files.

Wouldn't that be if it has a copy that is "expired" (so it has to re-request the resource)?

TheMadScientist

6:29 pm on Aug 6, 2015 (gmt 0)

Well, yeah, but go with the point, not what my crazy fingers typed -- LOL!

Thanks for catching it.

lucy24

10:58 pm on Aug 6, 2015 (gmt 0)

I got a bit sidetracked trying to figure out whether it's even physically possible for a browser's cache-- a human browser, I mean, not Google-- to have expired content. Modern caches can be vast, but they're not infinite. And you'd think the #1 candidate for dumping would be any material that's past its expiration date, whether set by response headers (Cache-Control, Expires ... there's a long list) or the browser's internal rules (I once looked it up and found a recommendation for 10% of the stated age of each item).

And now the fun part. In yesterday's logs, about an hour after the stuff I quoted above, I found a fresh package of requests for all supporting files (not just images but scripts, css, favicon) belonging to that same page:

{ IP in India } - - [06/Aug/2015:00:32:11 -0700] "GET /sharedstyles.css HTTP/1.1" 200 2201 "http://webcache.googleusercontent.com/search?q=cache:ghh0k0m6Cu8J:example.com/hovercraft/+&cd=8&hl=en&ct=clnk&gl=us" "{ someone else's browser}"

et cetera.

And your point is ...?

See where it says "example.com"? In my original post, I replaced my real domain name with "example.com". But here it's the actual text. So this isn't a real search leading to a "view cached page" click; it's a cut-and-paste from this very thread :) Further experimentation suggests that you can't simply delete everything after the "cache:blahblah" element-- though you can delete the other parameters-- but it doesn't seem to matter if the part right after the colon is the actual page URL or something else.

shaunm

9:07 am on Aug 7, 2015 (gmt 0)

Thanks @TheMadScientist & lucy24 :-)

The distinction between the browser making the request and Google making the request is the main difference between why the cached version will show images, css, js even Google is blocked in robots.txt

This is where I found my myself repeatedly stuck at. I think I can explain myself with an example.

Page A - Version I goes live on 08/01/2015
Page A - Version II goes live on 08/07/2015

Both these versions have different Images, CSS and JS and when the Version II goes live everything on the Version I becomes non-existent. So, for any file requests such as Images, CSS and JS the requested resources will result in a 404 on the server. Page A - Version I doesn't necessarily have to 301 redirect Version II either.

Meanwhile, between the changes, Google attempts to crawl Page A - Version I on 08/5/2015 and have a snapshot of it in its Cache (I'm sure both crawling and cache or two different things but the cache can only happen post its last crawl I believe).

I make a search for the Page A - Version I directly on Google or through a content search since I know it would still be in Google index as Google has a tendency to crawl my site once in a week.

I click to see the cache version of Page A - Version I from Google search. Here Google is supposed to deliver the HTML source of the page to whichever browser I'm on. Now, the browser is able to pull Page A - Version I absolutely fine while the live version on the server is Page A - Version II with all the new files.

Q. How is that possible if the browser is requesting the dependent files on my server from a cached (HTML) page where the resources are no longer exist on my server?

TheMadScientist

10:36 am on Aug 7, 2015 (gmt 0)

I haven't personally seen that happen -- I've seen cached pages with broken style sheets because I changed the name of the style sheet on a site and deleted the old version from the server, but haven't seen Google somehow "fill in the blanks" when resources aren't present.

If you're seeing it, have you emptied your browser's cache prior to viewing the page?

Where does your browser say it's getting the resource from? If you use Firefox or Safari (I don't use Chrome or IE much so IDK about those), but with either of the other two you can check to see where the resources are coming from fairly easily:

Firefox > Tools > Page Info > Media > Shows the Full URL of the source.
Safari > Develop > Show Page Resources > Look to the Right of the Resource.

I've checked quite a few cached versions of pages since this thread started and have yet to see either Firefox or Safari indicating even a single resource came from google.com, so my guess is if you're seeing information/resources not on host site's server, it's probably coming from your browser *or* an upstream [from you, downstream from the site] server's cache that's either caching when it's not supposed to or it's cache of the resource hasn't expired yet so it's serving you it's local resource rather than re-requesting it from the site itself, not Google -- If you find something different, please share, because I'd be interested in knowing more about it.

lucy24

8:15 pm on Aug 7, 2015 (gmt 0)

How is that possible if the browser is requesting the dependent files on my server from a cached (HTML) page where the resources are no longer exist on my server?

What's not possible? A request is only a request; it doesn't mean the files will turn out to be present. ("Folks in hell want ice cubes, but that don't mean they'll get 'em.")

It isn't clear whether you're reporting on what you saw with your eyeballs in your everyday browser, or what was present in your logs. Remember up above where I posted my first cached-page experiment? My server logs showed requests for all images but not scripts and stylesheets. Those were already in the browser's cache, because I'd viewed a different page in the same directory within the last few days. In order for logs to show requests for stylesheets, I had to view a cached page from a different directory, which I hadn't happened to visit in that browser recently.

View your cached page in a different browser, one you don't normally use, preferably an unrelated one. (In particular there are a lot of webkit browsers, and some of them may have the option of using each other's caches. Seem to remember a preference setting. Don't quote me on this, though.) Since this new browser has no cached material pertaining to your page, it will have to send in a fresh request to the server for each item. And that's where you'll get the 404s (in logs) and missing images (on screen); depending on the page, some styles may also be wrong.