homepage Welcome to WebmasterWorld Guest from 54.161.214.221
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 103 message thread spans 4 pages: < < 103 ( 1 [2] 3 4 > >     
Matt Cutts asks webmasters: let googlebot crawl js and css
tedster




msg:4433747
 12:55 am on Mar 27, 2012 (gmt 0)

In a new video "public service announcement" Matt Cutts asks webmasters to remove robots.txt disallow rules for js and css files. He says that Google will understand your pages better and be able to rank you more appropriately.

[youtube.com...]

 

PCInk




msg:4434190
 11:36 pm on Mar 27, 2012 (gmt 0)

Would you like to look at my piwik.js file and see what black-hat evils I'm hiding? Be my guest. For obvious reasons it isn't blocked to humans


So I can "be your guest" to check if your JavaScript contains no evil? I have to look at the file to check. Correct?

But if GoogleBot can't look at this file to check it contains "no evil", why should they trust your page/site? If they can't trust it, why should they rank it?

realmaverick




msg:4434194
 11:46 pm on Mar 27, 2012 (gmt 0)

I don't disallow my js or css. But what if Google mistakes something in the file, for malicious?

As Matt has already stated, this can help Google better place websites, does it mean they have already been penalising websites, based on their JS content?

I use a CMS, that has lots of ajax and js, and I have no idea exactly what is contained in the file.

Google are just digging a great big hole.

Samizdata




msg:4434231
 2:09 am on Mar 28, 2012 (gmt 0)

If you are hiding CSS/JS from search engines, you probably have something to hide

Nothing is being hidden - Googlebot is merely being told not to crawl or index the files.

I disallow Googlebot access to images, video and numerous HTML files as well.

Google's other stealth bots - and even their humans - still have unfettered access.

And compliance with robots.txt is entirely voluntary anyway.

...

lucy24




msg:4434242
 2:47 am on Mar 28, 2012 (gmt 0)

But if GoogleBot can't look at this file to check it contains "no evil", why should they trust your page/site? If they can't trust it, why should they rank it?

Because the contents of the file are none of their ### business, that's why. You might just as well say that if an /images directory is roboted-out, then google has no choice but to disregard any and all html pages that use those images. After all, you don't know what it might be a picture of-- or what else might be lurking in the directory.

Anyway, javascript files do not exist in a vacuum. They contain functions that are called by the originating document.

So here's the googlebot, puttering along through /fonts/font_name.html. (I pulled one at random.) Header says there are two associated javascript files: one in the /fonts/ directory, one in the /piwik/ directory. The first is fair game, the second is off-limits.

Further puttering reveals a call to function lookForFont(name). Is it in the javascript file in the /fonts/ directory? The one the robot is perfectly welcome to inspect? Why, yes it is. And so are all the subsidiary functions called by that first function. (Actually there aren't any, because this is the "root" function. But if there were others, they would be in the same file.)

Putter, putter.

  var piwikTracker = Piwik.getTracker("http://www.example.com/piwik/piwik.php", 1);
  piwikTracker.trackPageView();
  piwikTracker.enableLinkTracking();


I can't get into this folder! There must be something bad going on! Obviously all that "getTracker" business is just a smokescreen for some nefarious hanky-panky, cleverly concealed behind the tracker's default wording.

What to do, what to do?

I know! (Still speaking as the googlebot here.) I'll ask my pal Preview; he goes everywhere.

What the ###? Preview reports getting a consistent 403 slammed in his face. Is nothing sacred?

I wonder if they follow their own suggestions

I am struck by the number of "Allow" lines. What proportion of robots know what this means?

Sgt_Kickaxe




msg:4434250
 4:19 am on Mar 28, 2012 (gmt 0)

Nothing personal G, I don't let the oil change guy take a look at my valves either.

Curious though, with css blocked how are you generating such pretty website previews without ignoring the block?

gethan




msg:4434253
 4:36 am on Mar 28, 2012 (gmt 0)

Hey Matt - I'll unblock my js when google stops hotlinking my images on image search? deal?


The reason I disallow robots from js directories as the spiders execute various scripts which are intended only for real visitors - and starts running off following links to do with social interactions... causing ridiculous load.

The reason other sites do it might be for black hat seo purposes - but then wouldn't they just serve google one js and visitors another...

CainIV




msg:4434254
 4:36 am on Mar 28, 2012 (gmt 0)

But seriously, remind me again why Google needs to crawl these files? What is there to process in those files?

lucy24




msg:4434265
 4:54 am on Mar 28, 2012 (gmt 0)

with css blocked how are you generating such pretty website previews without ignoring the block?

Pay attention, willya? ;) Preview is not a robot and therefore is not bound by robots.txt. It doesn't even look at it. The only way to keep Preview out of css-- or anything else-- is by physically 403'ing it.

Andy Langton




msg:4434302
 9:03 am on Mar 28, 2012 (gmt 0)

Google have been grabbing robots-excluded CSS and JS for years. They have a bot that pretends to be a browser to do this if, I remember my log files correctly.

So, not really sure why this announcement. I suppose it's related to the "headless browser" crawling strategy, and getting a visual idea of textual prominence, as opposed to the previous matching of potentially dubious techniques.

But it does seem a bit of an overstretch. Bandwidth isn't free!

blend27




msg:4434303
 9:10 am on Mar 28, 2012 (gmt 0)

" a short public service announcement", what is it?, Cutts works for a Utility company now? what is crawling CSS and JS by a computer has to do with public service?

I generate JS and CSS on-the-fly based on several factors of the human visiting my sites. Just much for the visitor & display the page.

Tracking, display, honeypots(yes for scrapers) are in one directory called "assets".

User-agent: *
Disallow: /assets/

no referrer when accessing assets folder? banned on the spot after 3 retries on most of the sites

So you want my "assets"? then the PSA is this: PAY ME. Plus Give me back my keywords.

Whats with the Barney & Friends color tea shorts anyway?

triggerfinger




msg:4434424
 2:51 pm on Mar 28, 2012 (gmt 0)

Why dont they just disregard robots.txt? If they care so much about it...
"Rank appropriately"? Just rank to dang content and be done with it. Yikes.

tedster




msg:4434430
 3:11 pm on Mar 28, 2012 (gmt 0)

There's an answer for that - Google wants to rank the page, and not just "content" in the abstract. After all, visitors are interacting with the entire page in all its qualities. If you do and A/B testing, you know that CSS strongly affects time on page, conversions, etc.

rlange




msg:4434431
 3:13 pm on Mar 28, 2012 (gmt 0)

triggerfinger wrote:
Just rank to dang content and be done with it.

Scrapers would love that. Collect all the good content in one place and dominate the SERPs for any topic.

--
Ryan

Samizdata




msg:4434441
 3:31 pm on Mar 28, 2012 (gmt 0)

Google wants to rank the page, and not just "content" in the abstract.

I thought Google was a search engine, not a beauty pageant.

I want the page that best answers my search query.

I don't care what it looks like.

...

tedster




msg:4434442
 3:35 pm on Mar 28, 2012 (gmt 0)

Well, as we've observed here many times - you and I are not the average search user. And Google is quite focused on what the average user responds to, not the power user.

johnmoose




msg:4434479
 5:30 pm on Mar 28, 2012 (gmt 0)

Feb 2009 Matt answered a question from someone on the same subject. So its not new.

[youtube.com ]

potentialgeek




msg:4434481
 5:30 pm on Mar 28, 2012 (gmt 0)

Visual analysis seems like a natural progression of the Google algo.

Maybe they discovered this after in-house A/B testing with Google Website Optimizer!?

enigma1




msg:4434494
 6:14 pm on Mar 28, 2012 (gmt 0)

In my view the main problem with this is server resources.

Perhaps google wants to do additional validation on the side scripts and that has some benefits for webmasters too, like we get a report of non-existing css or js files which we can fix etc.

However the bandwidth waste to do that will be tremendous. Plus other spiders will surely follow this approach and maybe outweighs the benefits. Of course we could utilize compression, caching etc for these scripts on the server, but I don't know if the number of extra connections alone will interfere with the site's functionality.

Personally I don't block the side scripts because there is no need yet as googlebot won't crawl them, but we will see how this evolves and what resources will consume. I think at this point Matt owes a better explanation on the topic and what the extra accesses will be used for. Is it for some file validation, in-depth cross-referencing of code, indexing etc.

As a side note yahoo does something like that which is terrible at the moment from what I see in my logs. Then again they don't follow the cache headers.

agent_x




msg:4434562
 9:30 pm on Mar 28, 2012 (gmt 0)

I don't really see any issue with bandwidth, in my experience so far Google only seems to fetch CSS files once every couple of weeks or so. It's not something you tend to change that often after all.

lucy24




msg:4434616
 10:59 pm on Mar 28, 2012 (gmt 0)

Any way to make The People Up Top grasp that css and js are not the same thing? Style sheets will always concern the appearance of your page. If it says

{background-color: #00F; color: #0F0;}

then that might reasonably affect the user's perception of the site.

Javascript may or may not have anything to do with the appearance of the page. If it serves some other purpose, they don't need to crawl it. "Look but don't index" is not an option. You can't just slap a "noindex" at the top the way you do in html.

Same for things like robots.txt.* Maybe you can do some jiggery-pokery if you've got your own server. But for most people, if it can be crawled it will be indexed.

Query for those who use Google Analytics: Does the googlebot follow those links? Does it index what it finds at the far end? (Or can you even tell? piwik lives on your own server, so you can see when a robot comes snuffling around.)


* Oi! 81.104.117.7! Did you think I wouldn't see you poking around in there? ;)

phranque




msg:4434621
 12:05 am on Mar 29, 2012 (gmt 0)

You can't just slap a "noindex" at the top the way you do in html.

Using the X-Robots-Tag HTTP header:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag [developers.google.com]

lucy24




msg:4434669
 4:30 am on Mar 29, 2012 (gmt 0)

Cool. Thanks.

Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.


Know what's scary? To the folks at g### who wrote that paragraph, it is perfectly reasonable and logical. They seem to be saying:

You're not allowed to shut your door. You have to allow everyone into your house so you can ask them individually not to tell anyone what they've seen.

Or am I reading it backward? Wouldn't be the first time.

tedster




msg:4434670
 4:39 am on Mar 29, 2012 (gmt 0)

It seems pretty straightforward to me. Regular googlebot respects robots.txt directives. If ronots.txt says "don't crawl this URL" then they'll never see any robots meta tag or X-Robots directive associated with that URL.

That's a straightforward technical reality. If you want a robots meta tag or X-robots to be seen, then you've got to let it be crawled. The essence is this:

robots.txt is about crawling
robots meta tag is about indexing

phranque




msg:4434675
 5:48 am on Mar 29, 2012 (gmt 0)

the problem with the robots exclusion protocol is that there is no way to specify "don't crawl this url AND don't index this url."
it has been suggested before and would be very simple to have a Noindex directive for robots.txt that uses the same syntax as the Disallow.
that crabby guy set up a test for this and in 2008 through mid-2009 googlebot was respecting this experimental/undocumented directive.
so we know they have it in them to do it if they want...

i think the answer is to allow crawling and respond to bot requests with the "X-Robots-Tag: noindex" header and a "[Not Provided]" payload in the js/css file.

Seb7




msg:4434684
 7:27 am on Mar 29, 2012 (gmt 0)

I'm guessing they want to crawl them as to see how the pages render, then analyse the dom. Good for getting more metrics, like how many ads are appearing above the fold?

lucy24




msg:4434695
 8:20 am on Mar 29, 2012 (gmt 0)

If you can't crawl it, you don't have to be told not to index it, because you don't know anything about its content. You can only note the fact that it exists. How well does a completely blank, untitled page do in searches?

phranque




msg:4434729
 11:41 am on Mar 29, 2012 (gmt 0)

if someone links to your content in an unflattering manner you might not like how google shows that url & snippet in the index.
the snippet might be url-only but it might also be constructed from the anchor text and/or context of inbound links.
robots.txt exclusion might also prevent canonicalization efforts since an excluded url will not be requested and therefore will not see a redirect response.

Andy Langton




msg:4434731
 11:49 am on Mar 29, 2012 (gmt 0)

Further to phranque's comments, it's perfectly possible to rank a robots-excluded URL, and if needed Google will take the word of third parties as to what content it contains - primarily anchor text.

Links alone can be more than sufficient for many queries - particularly longer tail ones. But those can also include smaller brand names and other searches which are important in their own right - even if they are not particularly popular or competitive.

There's also a difference between a URL that has always been robots excluded, and one that Google has already captured, but is excluded at a later date. Google keeps history on URLs which can still impact on them after they are excluded via robots.txt directives.

rlange




msg:4434778
 1:08 pm on Mar 29, 2012 (gmt 0)

lucy24 wrote:
You can only note the fact that it exists.

Not even that, really. You could assume that it exists because you found a link pointing to it, but it still might result in a 404.

--
Ryan

enigma1




msg:4434823
 2:33 pm on Mar 29, 2012 (gmt 0)

in my experience so far Google only seems to fetch CSS files once every couple of weeks or so. It's not something you tend to change that often after all.

So far I haven't seen the bot to access css or js in my logs. Also a site may have lots of different js or css scripts loaded depending on the page accessed. It's not that everyone uses the same files for all pages.

Samizdata




msg:4434944
 7:32 pm on Mar 29, 2012 (gmt 0)

So far I haven't seen the bot to access css or js in my logs

The user-agent used for checking restricted files is never Googlebot.

But the IP addresses of the non-indexing stealth bots that do it are Mountain View.

Google has done this for years, but I can't remember Matt Cutts ever talking about it.

If he is now asking webmasters to make changes to help his company combat spammy results, as I believe he is, then saying so directly would be a much more honest and sensible way of going about it.

And put it in Webmaster Guidelines, not on YouTube.

...

This 103 message thread spans 4 pages: < < 103 ( 1 [2] 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved