Google Updates Webmaster Guidelines: Crawling Page Assets May Help SEO - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Updates Webmaster Guidelines: Crawling Page Assets May Help SEO

engine

6:19 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Google's just updated its Webmaster Guidelines and it makes is abundantly clear that you should allow Googlebot access to page asset files.

It seems your SEO and ranking efforts may be harmed by failing to allow Googlebot crawl, or to correctly index CSS, Javascript files. Therefore, your SEO may benefit by permitting the crawling.

As our crawling and indexing systems improved to render pages as part of normal indexing, today we're updating our webmaster guidelines to reflect that. Our new guidelines explain you should allow Googlebot to crawl the page assets (CSS, JavaScript, etc) so that we can index your content properly.

Let me be super clear about what this means: By blocking crawling of CSS and JS, you're actively harming the indexing of your pages. It's the easiest SEO you can do today. And don't forget your mobile site either!Google Updates Webmaster Guidelines: Crawling Page Assets [plus.google.com]

Here's more on the announcement.

We recently announced that our indexing system has been rendering web pages more like a typical modern browser, with CSS and JavaScript turned on. Today, we're updating one of our technical Webmaster Guidelines in light of this announcement.

For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.Updating Google Technical Webmaster Guidelines [googlewebmastercentral.blogspot.com]

aakk9999

6:27 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings

If this is the only consequence - that implies Googlebot could get your page better if it could read and execute javascript, then fine.

But let's hope Google will not actively give penalty to sites blocking some javascript or css.

For example, I am blocking some javascripts because this means I can keep out all these URLs with dates in them, as I create such URLs via javascript.

I know I can block them in robots.txt or noindex them, but why let Google go there at all and use valuable crawl budget to crawl URLs that have no value or to index blocked URLs that again have no values. Better to keep Googlebot oblivious to these since the only thing that is different is a date in URL and the page without the date is allowed to be indexed.

Robert Charlton

6:36 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Member levo noticed the change back in September, well before the official announcement, so we've had some observation and brief on this already in this earlier discussion....

Google Changes How it Handles Pages
Sept 1, 2014
http://www.webmasterworld.com/google/4699271.htm [webmasterworld.com]

Let's continue discussion on this thread, as the guidelines provide a perspective we didn't have before.

EditorialGuy

7:07 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Before worrying too much about this, I'd look at the intent behind the change: to let Google "render pages as part of normal indexing."

Are you letting Googlebot see your pages as users see them? Then you're probably fine, unless there's something on your pages that you don't want Google to see.

ogletree

7:13 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I wonder if this means that they are now following JS redirects that were used to not pass a penalty.

aristotle

7:26 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In the other thread referenced by Robert Charlton:

Planet13 wrote:
Hopefully that means those sites with the annoying css-based pop-over ads that take up nearly the full size of the screen will start to loose rankings.

I started a thread about this same problem several months ago [webmasterworld.com ](Should popups be a negative ranking factor?)
My main complaint then was that some of these popup ads can't be closed, thereby making it difficult if not impossible to see most of the content on the page. But whether they can be closed or not, they are always a big annoyance. So along with Planet13 and others, I also hope that this new capability will enable Google to finally take action.

Wilburforce

8:23 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think there is a valid argument against allowing googlebot access, as - if you don't - Google has no legitimate way of knowing whether what you are restricting access to is good or bad.

However, it looks to me that I will have to rethink my js navigation menu: it was kept from the bots for a "good" reason (I don't want them to see an "all pages can access all pages" menu, as it obfuscates site structure). It doesn't "hide" anything, in that all pages are accessible through the breadcrumb menu (which preserves site structure), but I recognise that a closed door is a closed door, however innocuous the contents of the room might be.

All the same, the effect is clear enough: damned if I do, and damned if I dont.

not2easy

8:44 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

One reason they want to have access to the supporting files is to be able to determine whether your site should be included in their mobile serps. They make it clear in their push for responsive design that if they can't read your css, it may limit your site's visibility in mobile search.

Also, in AdSense, if you want their ads to use mobile friendly asynchronous loading, they need to determine the size of the area you have made available for the ads.

They have been pushing to get us to remove "render-blocking" .js files from the header and use minified off-page resources placed at the closing </body> tag instead wherever possible. They want everything coded for the fringe 3G connection as much as possible. A sharp turn for most sites.

lucy24

9:56 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

:: detour to look up assorted things ::

Why do they want to see the scripts associated with a noindexed page? (I have several on my personal site.) Are they looking for links?

Does anyone else use a third-party analytics package such as piwik? My first thought every time I hear about search engines and js is "fine, but they're not pawing around in my piwik files!" Now I've looked pretty exhaustively and I'm ### if I can find any case of a major search engine asking for /piwik/ in any form. This can't have been the case since Day 1, or I wouldn't have needed to block them. (Weird but true: the only exception I can find in logs is the bingbot from only two days ago, requesting the noscript version of piwik.php. They may just be cleaning house, though.)

Wilburforce

10:45 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

They want everything coded for the fringe 3G connection as much as possible.

Wanting that is fair enough. However, I hope the fact that non-compliance "can result in suboptimal rankings" does not mean that de-lousing will follow resettlement.

Broadway

11:16 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I guess I've answered my own question but I would appreciate someone confirming my findings for me.

I'm using a Drupal module called "Collapse Text."
(There's a Wordpress plugin called "Collapse-O-Matic" that is similar.)

The idea is you use this module to create "read more" accordion-type boxes that open up and reveal their text when you click on them.
This action is javascript based.

In the past, I was under the impression that Google could index the text in the accordions.
(The rationale was that Google didn't use javascript when spidering and thus it saw all of the boxes as being expanded.)

I'm assuming this new announcement changes that?

I've gone to WebmasterTools and "fetched and rendered" a page.

When I look at the "Rendering" results I see a graphic of my page with the accordion boxes closed (their text is not visible).
However, when I look at the "Fetching" results I see the page's html and the text in question is present.

When I do a Google search for my page's URL plus some of the text contained in accordion boxes, I can't find any evidence that the text has been indexed.
(I see a cache date of 10/19/2014.)

I'm assuming that this recent Google change regarding javascript means that accordion-style pages (about us, our mission, contact us, etc...) are no longer appropriate for getting content indexed?

Help understanding this appreciated.

blend27

12:51 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

/blog/month_blogs.cfm?d=1&m=5&y=1328

Can any one guess what the data in query string mean?

Yes, G Bot went all the way to May 1st of 1328, hungry to index everything it can.

That actual /blog/ has a calendar display, old school one, that has a link for everyday of that year too.

I'd be careful with allowing to crawl JS.

BillyS

12:58 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

When I run this test, I get "Partial" and the offenders are JavaScript and css files from Google and AddThis. The Reason shown is "Blocked."

I didn't think we could control this through robots.txt, anyone else seeing the same?

blend27

1:21 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think there is a valid argument against allowing googlebot access, as - if you don't - Google has no legitimate way of knowing whether what you are restricting access to is good or bad.

Yes there is: and here is one.

One of my older sites was hacked as a part of a wider server hacks. A directory called Assets was disallowed in robots.txt from the get go.

Here is the kick: According to some online sources that capture backlinking to your site, there was over 80,000+ backings generated within 3 month from other hacked sites pointing /assets/css(js)/junk/kjhsd/jh/uiywer.html pages that were created. 80,000+.

This was one page site before, just some basic info. It had /assets/img, /assets/css, /assets/js, /assets/fonts folder references.

Each of those folders, as a part of the hack, were compromised with hundreds of spam /folders/pages generated.

mods:danger: Try this on GOOG: "moncler2014 hack"

Imagine the obvious?

aakk9999

3:49 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

However, it looks to me that I will have to rethink my js navigation menu

@Wilburforce
Please do share if you do. I will have the same problem. So far the only thing I can see is to actually leave it "on click"

A question - if Google is not indexing closed carousel / closed tabs (which can all be easily revealed on click, but do not generate a separate URL), doesn't this mean that they will suddenly lose quite bit of web content they currently have indexed?

Unless the "hidden" content being dropped out of index is just temporary, to force webmasters to unblock javascripts.

Or are we saying:

- you have a slider - only the first image / whatever is in slider is indexed
- you have a tabbed interface - only the content of the first (visible) tab is indexed
- you have a carousel - only the content of the first (visible) carousel element is indexed

If so, something is wrong here.

FranticFish

6:12 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google has no legitimate way of knowing whether what you are restricting access to is good or bad

Ah, the old 'if you've nothing to hide, you've nothing to fear' argument.

Those who don't play by Google's rules will no doubt be completely stymied by this move and will certainly not start cloaking to hide Javascript use.

And shouldn't sites with excessive popups and ads have poor usability and engagement metrics in the first place? If they don't, perhaps that means that the users are happy with the trade-off and it's not Google's decision to make?

I see potential for a lot of collateral damage from this move depending on the severity of the punishment.

if Google is not indexing closed carousel / closed tabs (which can all be easily revealed on click, but do not generate a separate URL) doesn't this mean that they will suddenly lose quite bit of web content they currently have indexed?

This seems to be the case from reading levo's earlier thread, but it also seems likely that granting access to the script file should fix this.

Clay_More

6:28 am on Oct 28, 2014 (gmt 0)

10+ Year Member

So good rankings are pretty much guaranteed if you are using https and allow access to asset files?

Oh wait, https is just a minor thing that really won't have much impact in the foreseeable future regardless of all the hype surrounding it. But unless Google has access to asset files tomorrow your sites are toast.

Or at least until they announce it will be a very minor ranking factor. I block things because I don't want it crawled, trying to force anything different shows no respect for how I want my content shown.

Lately it seems they are writing the manual on how to make adversarial information retrieval really adversarial. Not a good long term plan.

Robert Charlton

8:02 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Regarding the use of javascript to encode links, I'm reminded of this discussion from a few years back where Matt Cutts and Eric Enge were talking about PageRank sculpting using iFrames and javascript, and Matt made a comment, whose time perhaps has finally arrived, about bots and search engine users "traveling the same direction"....

Iframe Links: Do They Pass Page Rank?
March, 2011
http://www.webmasterworld.com/google/4282550.htm [webmasterworld.com]

I had quoted the following in the thread. Note that Matt says (in two different ways) that Google wants "the links and the pages that search engines find" to correspond...

Eric Enge: If someone did choose to do that (JavaScript encoded links or use an iFrame), would that be viewed as a spammy activity or just potentially a waste of their time?

Matt Cutts: ...In my experience, we typically want our bots to be seen on the same pages and basically traveling in the same direction as search engine users. I could imagine down the road if iFrames or weird JavaScript got to be so pervasive that it would affect the search quality experience, we might make changes on how PageRank would flow through those types of links.

It's not that we think of them as spammy necessarily, so much as we want the links and the pages that search engines find to be in the same neighborhood and of the same quality as the links and pages that users will find when they visit the site.

Wilburforce

8:26 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Ah, the old 'if you've nothing to hide, you've nothing to fear' argument.

Yes, and and hiding it "can result in suboptimal rankings".

Personally:

1. I don't like it and
2. I find their euphemisms chilling.

My earlier point, however, was that requiring access to all page assests isn't altogether unreasonable in the context of their overall policy that you shouldn't show users one thing and search engines another. Even my own - in my view reasonable and innocent - blocking access to js files could be construed as gaming the system: I want to hide potential massive dilution of PR, and direct more internal link juice to top-level pages than to footnotes.

I think Blend27's point about hacking is the best objection I have seen so far to the lack of counter-argument, but I'm not sure it works: hackers are not going to obey robots.txt anyway. It is a "Do Not Disturb" sign, not an impenetrable lock (although by that analogy it is my house, not a room in Google's hotel).

@aakk9999

I don't think putting it "on click" will work for my menu as it is, although it might work - and might also be a better menu - if I change it from drop-down to block. If you come up with anything sooner that I do, please let me know.

Robert Charlton has referred the earlier menu discussion back here, but for anyone else with a similar problem, see [webmasterworld.com ].

Wilburforce

8:38 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

it is my house, not a room in Google's hotel

And, thinking about it, that is the nub of this and every other problem that Google has beleaguered us with in recent years: their attitude to the internet is proprietorial.

PaulPA

8:28 pm on Oct 28, 2014 (gmt 0)

10+ Year Member

Is anyone aware of any online tools that can easily check whether Google Bot does have access to the "page asset files" that are discussed here?

ogletree

8:48 pm on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Unless you went out of your way to block them they would not be blocked. That is not something you do by accident.

aakk9999

10:18 pm on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is anyone aware of any online tools that can easily check whether Google Bot does have access to the "page asset files" that are discussed here?

As ogletree said above.

Or if you want to verify it, you can use WMT Fetch as Googlebot of a particular asset file you want to check googlebot access for.

PaulPA

11:00 pm on Oct 28, 2014 (gmt 0)

10+ Year Member

Well I do block certain directories in my CMS including a templates directory and a plugins directory. I assume something may be blocked that Google is interested in seeing, though it would be nice if there was something that easily showed that.

blend27

12:17 am on Oct 29, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I think Blend27's point about hacking is the best objection I have seen so far to the lack of counter-argument, but I'm not sure it works: hackers are not going to obey robots.txt anyway.

Let me reiterate here:

Access to Assets was blocked in Robots.txt. Same setup as on WebmasterWorld. The only folder on the site was Assets folder(with sub-foldres). When they(hackers) penetrated the site, that folder was blocked in robots.txt. So Goog and other white-listed bots obeyed Robots.txt and did not crawl the URLs, despite fact that there ARE thousands URLs pointing to newly created spam pages with in Assets.

Another site, on the same server, same hack, 19,000 urls indexed, all cached(I know iBill, I know), 209,000 back-links. grrrrrrr.

I would never expect myself to pay attention to CSS, JS, IMG folders, cause those don't get updated much after I am Done with the layout.

My Bad, I know.

seoskunk

12:31 am on Oct 29, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google need to crawl these pages I suppose to properly value a site, no big deal, I have opened them up in robots.txt

Regarding hacks @blend27 perhaps there should be a change to robots.txt where you can say these files/directories are allowed to be crawled but should not be indexed.

Something like...

User-agent:*
Allow-noindex: /scripts/
Allow-noindex: /js/

...etc

blend27

12:40 am on Oct 29, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

+n, valid point, in any event, too late in case...

lucy24

1:18 am on Oct 29, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Allow-noindex: /scripts/

If it isn't practical to edit all pages in a particular directory, you can set a "Robots: No-Index" header on the entire directory, or on specific filetypes within the directory. I've got a sitewide noindex on all \.(css|js) * files.

* Yah, sure, \.(j|cs)s but it looks too weird.

FranticFish

7:41 am on Oct 29, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Here's John Mueller back in June on a way to allow Google to crawl but not index script files: [webmasters.stackexchange.com...]

elguiri

11:58 am on Nov 26, 2014 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Is anyone aware of any online tools that can easily check whether Google Bot does have access to the "page asset files" that are discussed here?

First, as mentioned above, is Google Webmaster Tools' "Fetch and Render" tool.

You'll also get another, surprising, perspective with Google's PageSpeed Insights tool.

I have all assets - js,css and images - blocked to google on a couple of my sites. And yet the PageSpeed Insights tool renders them perfectly - mobile and desktop.