Matt Cutts asks webmasters: let googlebot crawl js and css

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Matt Cutts asks webmasters: let googlebot crawl js and css

tedster

12:55 am on Mar 27, 2012 (gmt 0)

In a new video "public service announcement" Matt Cutts asks webmasters to remove robots.txt disallow rules for js and css files. He says that Google will understand your pages better and be able to rank you more appropriately.

[youtube.com...]

Samizdata

7:32 pm on Mar 29, 2012 (gmt 0)

So far I haven't seen the bot to access css or js in my logs

The user-agent used for checking restricted files is never Googlebot.

But the IP addresses of the non-indexing stealth bots that do it are Mountain View.

Google has done this for years, but I can't remember Matt Cutts ever talking about it.

If he is now asking webmasters to make changes to help his company combat spammy results, as I believe he is, then saying so directly would be a much more honest and sensible way of going about it.

And put it in Webmaster Guidelines, not on YouTube.

...

enigma1

8:09 pm on Mar 29, 2012 (gmt 0)

Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

I was talking of the ones where the googlebot indexes pages normally and has the UA set. And yes, I don't know what they have in mind they haven't said much about it yet.

Andy Langton

8:17 pm on Mar 29, 2012 (gmt 0)

The bots I've seen grabbing JS/CSS from Google use a faked "normal" browser, and it's a fairly routine activity if you disallow those things.

lucy24

10:25 pm on Mar 29, 2012 (gmt 0)

Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

Can we stipulate that g### is not manually reviewing my site?

:: detour to random chunk of logs, containing (look at the proportions, not at the absolute numbers) from 66.249.nn.nn ::

I'll be ###. Google is evolving before our very eyes. Even two months ago when I was watching robot behavior closely, this is not the pattern I would have seen.

1 request for sitemap from Googlebot
6 requests for robots.txt from Googlebot (if it had been bingbot, there would have been 60 :))

62 pages:
--47 from the regular Googlebot
--14 from Googlebot-Mobile
--watch this space

90 images:
--28 Googlebot-Image with no referer
--54 Googlebot with human-style referer
--watch this space

5 stylesheets:
--3 Googlebot with human-style referer
--watch this space

Has everyone figured out what goes in the missing spaces?

66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET / HTTP/1.1" 200 1944 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" 
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /sharedstyles.css HTTP/1.1" 200 0 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)"

(I kinda think this was a mechanical glitch. Filesize 0?!)

66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /sharedstyles.css HTTP/1.1" 200 2589 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" 
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /images/WorldsHeadline.png HTTP/1.1" 200 2589 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" 
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /images/FunnyFace.jpg HTTP/1.1" 200 5042 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)"

(et cetera for the remaining 6 images that live on this page)

Doesn't that MSIE 7 business look just like bing? As long as they keep swiping ideas from google, it's only fair for google to turn around and swipe an idea from them.

If the front page had happened to use any .js files, they would have been picked up too. But not by 66.249; js goes to 74.125. Which, incidentally, seems to be turning into g###'s poor relation. When it isn't going around with no clothes at all as the faviconbot, you might find it dressed like this:

74.125.19.35 - - <snip> "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0"

or like this, in its mysterious Preview-less Preview costume:

74.125.63.33 - - <snip> "GET /piwik/piwik.js HTTP/1.1" "http://www.example.com/filename.html" 200 20113 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11"

If it had not been in plain clothes, it would have got a 403 slammed in its face. Time to fine-tune the htaccess.

CainIV

5:36 am on Mar 30, 2012 (gmt 0)

Short of having any reasonable explanation why they would need to access those files my only conclusion is that they are trying to determine more about links here.

enigma1

9:17 am on Mar 30, 2012 (gmt 0)

Has everyone figured out what goes in the missing spaces?

Although these IPs belong to google are they used for the robot are you certain? It's very strange.

For instance I see 66.249.72.nnn trying to login to my FTP server. So I'll take it either googlebot has gone way too far scanning servers, ports etc, or some of these IPs are open proxies and it's not googlebot.

PS: Meaning some systems behind the ips could be compromised

lucy24

10:08 am on Mar 30, 2012 (gmt 0)

Urk. If google's IP has been compromised, nobody is safe. They go everywhere. That's the whole point of being google.

Another off-the-top-of-my-head conjecture: G### is sneaking around in plain clothes, emulating a human, to see whether humans really get what the googlebot thinks they're getting.

agent_x

10:21 am on Mar 30, 2012 (gmt 0)

G### is sneaking around in plain clothes, emulating a human, to see whether humans really get what the googlebot thinks they're getting.

Which is exactly what any reasonable webmaster would expect them to be doing.

enigma1

10:36 am on Mar 30, 2012 (gmt 0)

After reading more about it, the ftp is documented here:
[developers.google.com...]

Let alone the weaknesses that's not a good approach for the ftp issue as far I can see the virtual hosts workaround for ftp is still on draft.
[tools.ietf.org...]
So even if someone wanted he could not setup the ftp to deal with different domains on the same physical resources via ftp, serving content to spiders. Not reliably at least. Google should know something like this may cause security problems at this time and limit it to sites who submit specific port details via their accounts.

[edited by: tedster at 11:43 am (utc) on Mar 30, 2012]
[edit reason] make first link active [/edit]

n00b1

2:49 pm on Mar 30, 2012 (gmt 0)

Why does it say it was uploaded on August 18th 2011?

Samizdata

3:21 pm on Mar 30, 2012 (gmt 0)

Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

The accesses are never from Googlebot, but they are not human either.

Mountain View pulls, for example, a single javaScript file, or a couple of CSS files, or whatever else they feel they need to check that they are not being gamed.

exactly what any reasonable webmaster would expect them to be doing

Indeed, the difference being that stealth bots do not add the files to the search index.

The question is, why make this "public service announcement", and why now?

Matt Cutts says "please... unblock... if you can... you don't need to do that now".

If he means that Google will no longer index CSS or javaScript files, he doesn't say so.

If he means that Google will no longer use unidentified stealth bots, he doesn't say so.

If he means that Google will penalise sites that restrict access, he doesn't say so.

If he means that Google's ranking system is not as smart as they like people to think, and that it is being gamed far too easily, and that he wants webmasters to help his company out by making changes to their sites, he doesn't say that either.

But anything to do with robots.txt compliance is a charade anyway.

...

HitProf

1:45 pm on Apr 2, 2012 (gmt 0)

Sounds like blocking .js and .css are going to get you your first points towards an over opimization penaly?

rlange

2:31 pm on Apr 2, 2012 (gmt 0)

HitProf wrote:
Sounds like blocking .js and .css are going to get you your first points towards an over opimization penaly?

What makes you think that?

--
Ryan

HitProf

2:51 pm on Apr 2, 2012 (gmt 0)

Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that. They might assume that you are messing around with visibility or content flow. They might not lik ethat, especially if you have done a lot of other things to please Googlebot. Just assuming. This might take a while, but don't be surprised to see Google evolve in this direction over time.

rlange

4:16 pm on Apr 2, 2012 (gmt 0)

HitProf wrote:
Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that. They might assume that you are messing around with visibility or content flow. They might not lik ethat, especially if you have done a lot of other things to please Googlebot. Just assuming.

Is it reasonable to assume that Google's algorithm would apply a penalty because of a lack of positive evidence for any wrongdoing? Even the government in George Orwell's 1984 didn't go that far.

This might take a while, but don't be surprised to see Google evolve in this direction over time.

This isn't an assumption; it's a prediction. I'm curious why you think "disallowing the crawling of external JavaScript and CSS files will result in an over-optimization penalty" is a logical outcome. On what past observations and/or information are you basing this prediction?

Sure, cloaking can result in a penalty, but even that requires some sort of confirmation, like a manual review (if I understand correctly). Google didn't just slap a penalty on your site because your site used JavaScript that Googlebot couldn't understand and therefore might mean you were doing something naughty.

--
Ryan

lucy24

4:50 pm on Apr 2, 2012 (gmt 0)

Do you think it's possible to teach google that crawl and index are different things? Not everyone has the capacity to stick a "noindex" tag on something other than an html page.

HitProf

5:25 pm on Apr 2, 2012 (gmt 0)

It's just a hunch, I'm curious to hear other opinions. (Somehow I can't make those nice little quote boxes).

lucy24

9:41 pm on Apr 2, 2012 (gmt 0)

Somehow I can't make those nice little quote boxes

You can type them in manually as
[quote]blahblah[/quote]
or hit Preview and the buttons will magically appear.

Now, if there's a secret trick to making them say "So-and-so said..." when you're quoting a string of different people and you don't want to put tangor's words into g1smd's mouth...

LinkWorxSeo

2:38 am on Apr 3, 2012 (gmt 0)

Read the majority of the post on here and think there are plenty of takes of what it is right and what is wrong. If Google has decided the bots are smart enough to crawl your .js and .css then let them do it. Just do not think Google would send a crawler through your scripts and style sheets and not be capable to interpret what is in those files for one thing. Secondly, if your hiding something then it would be a problem. The only thing about it is the outcome, that seems to be what everyone is afraid of now a day's. Recommendation, fix what your doing wrong first then let them in, but who's not to say they are doing it and you just do not know it yet. They could be giving you a warning before the negative effect gets you because your running black hat seo.

rlange

10:23 pm on Apr 3, 2012 (gmt 0)

lucy24 wrote:
Now, if there's a secret trick to making them say "So-and-so said..." when you're quoting a string of different people and you don't want to put tangor's words into g1smd's mouth...

Not that I'm aware of. I add that myself.

--
Ryan

Staffa

9:53 am on Apr 6, 2012 (gmt 0)

Today :
/forum.css - 80 - 66.249.72.102 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

and this is not the first time, though css has been disallowed in robots.txt since day 1, several years ago.

It's pretty hypocritical of Cutts to "ask" to allow their bot to crawl css if they de facto ignore robots.txt in the first place.

bumpski

1:33 pm on Apr 6, 2012 (gmt 0)

Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that.

Google IP addresses must be accessing almost all css and js files now. Google's "preview" is likely a foundation for Panda. It would explain the slow update rate of Panda; it must take a lot of horse power to generate all those previews. With Preview they can parse the Document Object Model (the structure) of all your pages, which means Google can fully understand your page's content structure, js, and css now.

The only thing left for Google to find out is how a site might be serving pages differently via js and css to the many browser types that are out there now, probably especially mobile. They probably can't afford the horse power to render "previews" for all possible browsers.

So with preview they've already reviewed your css and js files, now they just want to see the variations.

Given that context, it seems less likely Google has many evil intentions by asking webmasters to unblock the few remaining js and css files Google hasn't already reviewed, and fully understood, by parsing the Document Object Model using the "preview" results.

In the past I had one comment to a post about CSS; that being "how could Google possibly understand CSS?". CSS must be a lot easier to understand than English, Spanish, Japanese, etc; which is Google Search's goal.

LinkWorxSeo

10:11 pm on Apr 6, 2012 (gmt 0)

Should these files become unblocked by someone and then index by Google, what about Bing and Yahoo!? How will this effect those results and what is the likely hood of them requesting the same if they have not already done so?

bumpski

7:55 pm on Apr 9, 2012 (gmt 0)

Server side I embed my ccs in the html source file. Javascript is directly embedded. Other than photos, and ads, my visitor's browsers only have to open 1 file to see my web page's content (other than a background image). I've figured since the introduction of CSS, Google, and all GOOD search engines, should see the exact content of my website(s).

With GZIP compression, server side, ONE file is the way to go for maximum performance and openness.

For years Google was not looking at CSS. I thought, why hide CSS in files Google does not read. If Google sees every detail of a website, they will have more confidence in their search results; and thereby a site might have a higher ranking. Now, of course, they do want to see CSS.

Unfortunately, at least for one of my sites, after Oct. 13th this openness is not good enough for Google!

Samizdata

10:19 pm on Apr 9, 2012 (gmt 0)

why hide

Nothing is being hidden.

Disallowing files in robots.txt is a request not to index them.

There is nothing to stop Google accessing and inspecting the files if they want.

And they do so.

...

tedster

10:27 pm on Apr 9, 2012 (gmt 0)

Disallowing files in robots.txt is a request not to index them.

As I understand things, Disallow is a request for the named bot not to CRAWL the file.

Andy Langton

11:24 pm on Apr 9, 2012 (gmt 0)

Yeah - hence "URL only" listings. The bot knows of the existence of the URL, but (theoretically) not the contents.

jmccormac

12:45 am on Apr 12, 2012 (gmt 0)

The way I look at it now, the Borg (Google) wants to grab CSS and JS so that it can keep users on its advertising portal. The snippets thing is just an attempt to make their advertising portal sticky. Good search engines give people what they want and should have a high bounce rate. But then I'm cynical and I'm not a Google fanboy.

Regards...jmcc

Staffa

10:32 am on Apr 12, 2012 (gmt 0)

Today :
/site.css - 80 - 66.249.72.102 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

That IP number is now put in quarantine, i.e. it gets redirected to robots.txt for a number of visits until it knows what's in it ;o)

rlange

2:15 pm on Apr 12, 2012 (gmt 0)

jmccormac wrote:
The way I look at it now, the Borg (Google) wants to grab CSS and JS so that it can keep users on its advertising portal.

How is that supposed to work?

--
Ryan

This 103 message thread spans 4 pages: 103