Welcome to WebmasterWorld Guest from 54.162.213.67

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

Matt Cutts asks webmasters: let googlebot crawl js and css

     
12:55 am on Mar 27, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


In a new video "public service announcement" Matt Cutts asks webmasters to remove robots.txt disallow rules for js and css files. He says that Google will understand your pages better and be able to rank you more appropriately.

[youtube.com...]
7:32 pm on Mar 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


So far I haven't seen the bot to access css or js in my logs

The user-agent used for checking restricted files is never Googlebot.

But the IP addresses of the non-indexing stealth bots that do it are Mountain View.

Google has done this for years, but I can't remember Matt Cutts ever talking about it.

If he is now asking webmasters to make changes to help his company combat spammy results, as I believe he is, then saying so directly would be a much more honest and sensible way of going about it.

And put it in Webmaster Guidelines, not on YouTube.

...
8:09 pm on Mar 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 30, 2007
posts:1394
votes: 0


Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

I was talking of the ones where the googlebot indexes pages normally and has the UA set. And yes, I don't know what they have in mind they haven't said much about it yet.
8:17 pm on Mar 29, 2012 (gmt 0)

Moderator This Forum from GB 

WebmasterWorld Administrator andy_langton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 27, 2003
posts:3326
votes: 133


The bots I've seen grabbing JS/CSS from Google use a faked "normal" browser, and it's a fairly routine activity if you disallow those things.
10:25 pm on Mar 29, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

Can we stipulate that g### is not manually reviewing my site?

:: detour to random chunk of logs, containing (look at the proportions, not at the absolute numbers) from 66.249.nn.nn ::

I'll be ###. Google is evolving before our very eyes. Even two months ago when I was watching robot behavior closely, this is not the pattern I would have seen.

1 request for sitemap from Googlebot
6 requests for robots.txt from Googlebot (if it had been bingbot, there would have been 60 :))

62 pages:
--47 from the regular Googlebot
--14 from Googlebot-Mobile
--watch this space

90 images:
--28 Googlebot-Image with no referer
--54 Googlebot with human-style referer
--watch this space

5 stylesheets:
--3 Googlebot with human-style referer
--watch this space

Has everyone figured out what goes in the missing spaces?

66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET / HTTP/1.1" 200 1944 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" 
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /sharedstyles.css HTTP/1.1" 200 0 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)"

(I kinda think this was a mechanical glitch. Filesize 0?!)
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /sharedstyles.css HTTP/1.1" 200 2589 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" 
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /images/WorldsHeadline.png HTTP/1.1" 200 2589 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)"
66.249.17.123 - - [02/Mar/2012:08:49:35 -0800] "GET /images/FunnyFace.jpg HTTP/1.1" 200 5042 "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)"

(et cetera for the remaining 6 images that live on this page)

Doesn't that MSIE 7 business look just like bing? As long as they keep swiping ideas from google, it's only fair for google to turn around and swipe an idea from them.

If the front page had happened to use any .js files, they would have been picked up too. But not by 66.249; js goes to 74.125. Which, incidentally, seems to be turning into g###'s poor relation. When it isn't going around with no clothes at all as the faviconbot, you might find it dressed like this:

74.125.19.35 - - <snip> "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0"


or like this, in its mysterious Preview-less Preview costume:

74.125.63.33 - - <snip> "GET /piwik/piwik.js HTTP/1.1" "http://www.example.com/filename.html" 200 20113 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11"


If it had not been in plain clothes, it would have got a 403 slammed in its face. Time to fine-tune the htaccess.
5:36 am on Mar 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 19, 2004
posts:1939
votes: 0


Short of having any reasonable explanation why they would need to access those files my only conclusion is that they are trying to determine more about links here.
9:17 am on Mar 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 30, 2007
posts:1394
votes: 0


Has everyone figured out what goes in the missing spaces?

Although these IPs belong to google are they used for the robot are you certain? It's very strange.

For instance I see 66.249.72.nnn trying to login to my FTP server. So I'll take it either googlebot has gone way too far scanning servers, ports etc, or some of these IPs are open proxies and it's not googlebot.

PS: Meaning some systems behind the ips could be compromised
10:08 am on Mar 30, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Urk. If google's IP has been compromised, nobody is safe. They go everywhere. That's the whole point of being google.

Another off-the-top-of-my-head conjecture: G### is sneaking around in plain clothes, emulating a human, to see whether humans really get what the googlebot thinks they're getting.
10:21 am on Mar 30, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 27, 2011
posts:97
votes: 0


G### is sneaking around in plain clothes, emulating a human, to see whether humans really get what the googlebot thinks they're getting.


Which is exactly what any reasonable webmaster would expect them to be doing.
10:36 am on Mar 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 30, 2007
posts:1394
votes: 0


After reading more about it, the ftp is documented here:
[developers.google.com...]

Let alone the weaknesses that's not a good approach for the ftp issue as far I can see the virtual hosts workaround for ftp is still on draft.
[tools.ietf.org...]
So even if someone wanted he could not setup the ftp to deal with different domains on the same physical resources via ftp, serving content to spiders. Not reliably at least. Google should know something like this may cause security problems at this time and limit it to sites who submit specific port details via their accounts.

[edited by: tedster at 11:43 am (utc) on Mar 30, 2012]
[edit reason] make first link active [/edit]

2:49 pm on Mar 30, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:May 17, 2011
posts: 193
votes: 0


Why does it say it was uploaded on August 18th 2011?
3:21 pm on Mar 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


Are these accesses you're talking about from the bot or these are due to manual review or another validation mechanism.

The accesses are never from Googlebot, but they are not human either.

Mountain View pulls, for example, a single javaScript file, or a couple of CSS files, or whatever else they feel they need to check that they are not being gamed.

exactly what any reasonable webmaster would expect them to be doing

Indeed, the difference being that stealth bots do not add the files to the search index.

The question is, why make this "public service announcement", and why now?

Matt Cutts says "please... unblock... if you can... you don't need to do that now".

If he means that Google will no longer index CSS or javaScript files, he doesn't say so.

If he means that Google will no longer use unidentified stealth bots, he doesn't say so.

If he means that Google will penalise sites that restrict access, he doesn't say so.

If he means that Google's ranking system is not as smart as they like people to think, and that it is being gamed far too easily, and that he wants webmasters to help his company out by making changes to their sites, he doesn't say that either.

But anything to do with robots.txt compliance is a charade anyway.

...
1:45 pm on Apr 2, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts: 1377
votes: 0


Sounds like blocking .js and .css are going to get you your first points towards an over opimization penaly?
2:31 pm on Apr 2, 2012 (gmt 0)

Full Member

5+ Year Member

joined:Mar 22, 2011
posts:339
votes: 0


HitProf wrote:
Sounds like blocking .js and .css are going to get you your first points towards an over opimization penaly?

What makes you think that?

--
Ryan
2:51 pm on Apr 2, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts: 1377
votes: 0


Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that. They might assume that you are messing around with visibility or content flow. They might not lik ethat, especially if you have done a lot of other things to please Googlebot. Just assuming. This might take a while, but don't be surprised to see Google evolve in this direction over time.
4:16 pm on Apr 2, 2012 (gmt 0)

Full Member

5+ Year Member

joined:Mar 22, 2011
posts:339
votes: 0


HitProf wrote:
Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that. They might assume that you are messing around with visibility or content flow. They might not lik ethat, especially if you have done a lot of other things to please Googlebot. Just assuming.

Is it reasonable to assume that Google's algorithm would apply a penalty because of a lack of positive evidence for any wrongdoing? Even the government in George Orwell's 1984 didn't go that far.

This might take a while, but don't be surprised to see Google evolve in this direction over time.

This isn't an assumption; it's a prediction. I'm curious why you think "disallowing the crawling of external JavaScript and CSS files will result in an over-optimization penalty" is a logical outcome. On what past observations and/or information are you basing this prediction?

Sure, cloaking can result in a penalty, but even that requires some sort of confirmation, like a manual review (if I understand correctly). Google didn't just slap a penalty on your site because your site used JavaScript that Googlebot couldn't understand and therefore might mean you were doing something naughty.

--
Ryan
4:50 pm on Apr 2, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Do you think it's possible to teach google that crawl and index are different things? Not everyone has the capacity to stick a "noindex" tag on something other than an html page.
5:25 pm on Apr 2, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts: 1377
votes: 0


It's just a hunch, I'm curious to hear other opinions. (Somehow I can't make those nice little quote boxes).
9:41 pm on Apr 2, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Somehow I can't make those nice little quote boxes

You can type them in manually as
[quote]blahblah[/quote]
or hit Preview and the buttons will magically appear.

Now, if there's a secret trick to making them say "So-and-so said..." when you're quoting a string of different people and you don't want to put tangor's words into g1smd's mouth...
2:38 am on Apr 3, 2012 (gmt 0)

New User

joined:Apr 3, 2012
posts: 2
votes: 0


Read the majority of the post on here and think there are plenty of takes of what it is right and what is wrong. If Google has decided the bots are smart enough to crawl your .js and .css then let them do it. Just do not think Google would send a crawler through your scripts and style sheets and not be capable to interpret what is in those files for one thing. Secondly, if your hiding something then it would be a problem. The only thing about it is the outcome, that seems to be what everyone is afraid of now a day's. Recommendation, fix what your doing wrong first then let them in, but who's not to say they are doing it and you just do not know it yet. They could be giving you a warning before the negative effect gets you because your running black hat seo.
10:23 pm on Apr 3, 2012 (gmt 0)

Full Member

5+ Year Member

joined:Mar 22, 2011
posts:339
votes: 0


lucy24 wrote:
Now, if there's a secret trick to making them say "So-and-so said..." when you're quoting a string of different people and you don't want to put tangor's words into g1smd's mouth...

Not that I'm aware of. I add that myself.

--
Ryan
9:53 am on Apr 6, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Today :
/forum.css - 80 - 66.249.72.102 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

and this is not the first time, though css has been disallowed in robots.txt since day 1, several years ago.

It's pretty hypocritical of Cutts to "ask" to allow their bot to crawl css if they de facto ignore robots.txt in the first place.
1:33 pm on Apr 6, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 13, 2004
posts:825
votes: 10


Google wants to know what your page looks like to the user. If you hide your css and/or js, Googlebot can't see that.


Google IP addresses must be accessing almost all css and js files now. Google's "preview" is likely a foundation for Panda. It would explain the slow update rate of Panda; it must take a lot of horse power to generate all those previews. With Preview they can parse the Document Object Model (the structure) of all your pages, which means Google can fully understand your page's content structure, js, and css now.

The only thing left for Google to find out is how a site might be serving pages differently via js and css to the many browser types that are out there now, probably especially mobile. They probably can't afford the horse power to render "previews" for all possible browsers.

So with preview they've already reviewed your css and js files, now they just want to see the variations.

Given that context, it seems less likely Google has many evil intentions by asking webmasters to unblock the few remaining js and css files Google hasn't already reviewed, and fully understood, by parsing the Document Object Model using the "preview" results.

In the past I had one comment to a post about CSS; that being "how could Google possibly understand CSS?". CSS must be a lot easier to understand than English, Spanish, Japanese, etc; which is Google Search's goal.
10:11 pm on Apr 6, 2012 (gmt 0)

New User

joined:Apr 3, 2012
posts:2
votes: 0


Should these files become unblocked by someone and then index by Google, what about Bing and Yahoo!? How will this effect those results and what is the likely hood of them requesting the same if they have not already done so?
7:55 pm on Apr 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 13, 2004
posts:825
votes: 10


Server side I embed my ccs in the html source file. Javascript is directly embedded. Other than photos, and ads, my visitor's browsers only have to open 1 file to see my web page's content (other than a background image). I've figured since the introduction of CSS, Google, and all GOOD search engines, should see the exact content of my website(s).

With GZIP compression, server side, ONE file is the way to go for maximum performance and openness.

For years Google was not looking at CSS. I thought, why hide CSS in files Google does not read. If Google sees every detail of a website, they will have more confidence in their search results; and thereby a site might have a higher ranking. Now, of course, they do want to see CSS.

Unfortunately, at least for one of my sites, after Oct. 13th this openness is not good enough for Google!
10:19 pm on Apr 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


why hide

Nothing is being hidden.

Disallowing files in robots.txt is a request not to index them.

There is nothing to stop Google accessing and inspecting the files if they want.

And they do so.

...
10:27 pm on Apr 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Disallowing files in robots.txt is a request not to index them.

As I understand things, Disallow is a request for the named bot not to CRAWL the file.
11:24 pm on Apr 9, 2012 (gmt 0)

Moderator This Forum from GB 

WebmasterWorld Administrator andy_langton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 27, 2003
posts:3326
votes: 133


Yeah - hence "URL only" listings. The bot knows of the existence of the URL, but (theoretically) not the contents.
12:45 am on Apr 12, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2504
votes: 41


The way I look at it now, the Borg (Google) wants to grab CSS and JS so that it can keep users on its advertising portal. The snippets thing is just an attempt to make their advertising portal sticky. Good search engines give people what they want and should have a high bounce rate. But then I'm cynical and I'm not a Google fanboy.

Regards...jmcc
10:32 am on Apr 12, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Today :
/site.css - 80 - 66.249.72.102 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)

That IP number is now put in quarantine, i.e. it gets redirected to robots.txt for a number of visits until it knows what's in it ;o)
2:15 pm on Apr 12, 2012 (gmt 0)

Full Member

5+ Year Member

joined:Mar 22, 2011
posts:339
votes: 0


jmccormac wrote:
The way I look at it now, the Borg (Google) wants to grab CSS and JS so that it can keep users on its advertising portal.

How is that supposed to work?

--
Ryan
This 103 message thread spans 4 pages: 103