Javascript in robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Javascript in robots.txt

member22

10:09 am on Jan 2, 2014 (gmt 0)

Hi,

I use joomla and in my robots.txt I have blocked javscript and was wondering why it is recommended not to block it.

Then I blocked modules, mailto and components but when i type site:mywebsite.com in google the pages appear with the following description.

"A description for this result is not available because of this site's robots.txt – learn more."

What do you recommend that I do ?

I blocked everything in order to avoid having pages indexed but I am not sure it is the right thing to do ?

Thank you,

Robert Charlton

10:40 am on Jan 2, 2014 (gmt 0)

in my robots.txt I have blocked javscript and was wondering why it is recommended not to block it.

member22 - I think you're misinterpreting the message...

A description for this result is not available because of this site's robots.txt – learn more.

The message isn't about not blocking javascript per se with robots.txt... it's about not depending on robots.txt to keep a url or reference to a page out of the index.

robots.txt is used to keep Googlebot from crawling a page. But it won't keep out of the index another reference to that page (ie, a link) which Googlebot might find on a page elsewhere on the web... if that reference is not blocked by robots.txt. Google will index the reference and return it with the "description not available" message.

If you want to keep any reference to a page out of the visible index, you need to use the meta robots noindex tag in the head section of the page whose url you don't want to appear. There's an apparent paradox here, though, since, for Googlebot to see the meta robots tag on the page, it needs to crawl the page, so you can't use robots.txt to block a page whose meta robots tag you wish to be observed.

I've had the discussion too many times to go over all the nuances again, but I suggest reading this thread several times. It takes a while to sink in....

Pages are indexed even after blocking in robots.txt
http://www.webmasterworld.com/google/4490125.htm [webmasterworld.com]

I'm sure others will jump in with their elaborations.

member22

9:48 am on Jan 3, 2014 (gmt 0)

I had a in fact 2 questions but wasn't clear in my message.

- the 1 st one was about why is it bad to block javascript in robots ?

For example on my robots.txt I have the following blocked

Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Disallow:/javascript/

- My 2 nd questions is why do I have for example my page www.mywebsite.com/administrator indexed ( with the message " description not available ) in google when I type site:mywebsite.com when in my robots it says
Disallow: /administrator/

Is it because there is a link somewhere else on the web for that page ?

Robert Charlton

10:20 am on Jan 3, 2014 (gmt 0)

Is it because there is a link somewhere else on the web for that page ?

Perhaps Googlebot spidered a publicly available server log or something like that. That would be enough to get the "references" or links to these pages noted in Google's index. Since Googlebot obeys the robots.txt disallow, it doesn't crawl the pages, and thus doesn't know whether these are important pages that were accidentally blocked.

I'm assuming you're seeing these in a site:example.com search... not in competitive results.

Again... if you want to understand what's going on, I suggest reading the thread I mentioned, and read it several times.

If you just want to fix the problem, I recommend that you set the meta robots tags on your offending pages to roughly this syntax...

<meta name="robots" content="noindex">

...and that you remove the disallows from robots.txt so that Googlebot can spider your pages, see the meta robots tag, and noindex each page.

(You shouldn't be using robots.txt to keep pages out of the index.)

[edited by: Robert_Charlton at 10:30 am (utc) on Jan 3, 2014]

member22

10:24 am on Jan 3, 2014 (gmt 0)

Does having pages like this diluate page rank ?

lucy24

10:25 am on Jan 3, 2014 (gmt 0)

why do I have for example my page www.mywebsite.com/administrator indexed ( with the message " description not available ) in google when I type site:mywebsite.com when in my robots it says
Disallow: /administrator/

It is precisely because the page is disallowed. Google knows that the page exists, but it is not allowed to crawl, so it can't show any content.

You are not alone. It took me at least a year to wrap my brain around the idea that
crawl != index
or, if you prefer,
crawl <> index

If you don't want search engines to mention the page's existence, you have to let them crawl it and show them a "noindex" directive. Usually it's in the form of a robots meta

<meta name = "robots" content = "noindex">

but sometimes you have to do it by setting a header instead. One of those "sometimes" is non-html content such as javascript, for example (cut-and-paste from my own htaccess):

<FilesMatch "\.(js|txt|xml)$">
Header set X-Robots-Tag "noindex"
</FilesMatch>

Now, if you really don't want search engines anywhere within several miles of your content, and also don't want them to mention its existence, you could slap them with a 403 based on IP and/or user-agent. Then they'll get mad and won't even mention the page. But this is Not Nice. (And possibly also not a good idea.) If you want to split hairs I guess it counts as cloaking, though not in the usual sense.

Search engines have been asking to crawl javascript because it's an essential part of some pages' content. But you probably don't want them to index it. And some scripts, such as analytics, are clearly none of the search engine's business. (Idle query: Just how many sites have juicy, content-influencing scripts masquerading behind the name "piwik.js"?)

member22

10:29 am on Jan 3, 2014 (gmt 0)

I am asking because we had a major bug with joomla about a year ago where google indexed over 700 pages for our website ( it created those but surfing like crazy through our server ) when our website is less than 50 pages.

My worry is that we can't get our ranking back because all those pages that it found and created are still on public server logs ?

Could it be possible that it find those pages on server logs and considers those as duplicate content even if we added a 410 command on our site to remove those pages from google index ?

according to the webmaster tools it did remove almost all of them… publicly available server log seem to be different no ?

phranque

10:45 am on Jan 3, 2014 (gmt 0)

I blocked everything in order to avoid having pages indexed but I am not sure it is the right thing to do ?

the Disallow: directive in robots.txt means exclude from crawling, not blocked from indexing.

why is it bad to block javascript in robots ?

google might think you are showing googlebot different content than live users and you might get "live browser" visits from a mountain view IP address to verify everything looks legit.

Is it because there is a link somewhere else on the web for that page ?

google likely discovered the url somewhere, not necessarily somewhere else.
note that besides anchor elements and the other usual places, a discoverable url may appear in a link element in the head of your document, for example.

Does having pages like this diluate page rank ?

that depends on a lot of specifics, but disallowing crawling of a url can affect the flow of pagerank on a site.

phranque

10:51 am on Jan 3, 2014 (gmt 0)

Could it be possible that it find those pages on server logs and considers those as duplicate content even if we added a 410 command on our site to remove those pages from google index ?

the 410 won't help if you block googlebot from crawling the url.

according to the webmaster tools it did remove almost all of them

"remove" from where?
always a good idea to use "fetch as googlebot" in GWT to help you understand what the crawlers sees.
perhaps being reported as "blocked by robots.txt"?

member22

11:46 am on Jan 3, 2014 (gmt 0)

Thank you for all your replies.

I think I have figured out what my issue was. I have a robots.txt / module but google still shows pages ( with " description not available " ) that include /module in the web address.

I do believe it does this because I have a robots.txt for the / module folder and when i got my bug those pages got indexed.

If I remove the /modules in the robots.txt they are all going to be indexed and then I can remove them one by one in the GWT and then robots.txt the / module folder and they won't appear anymore is that correct ?

Other solution leave the robots.txt for /module remove each page that appear in index with the removal tool and do that until they are all removed from the index.

What method do you recommend ?

Can it hurt my ranking to have a /module page indexed in google but not crawl able because of the robots.txt.

aakk9999

11:58 am on Jan 3, 2014 (gmt 0)

"remove" from where?

I am wondering whether this means that the number of pages Google has indexed dropped and these "fixed" URLs show under "Not Found" section of WMT.

note that besides anchor elements and the other usual places, a discoverable url may appear in a link element in the head of your document, for example.

Even worse, Google would use the URL it sees in HTML comments for discovery. If it looks like URL, Google will try it.

we had a major bug with joomla about a year ago (...) My worry is that we can't get our ranking back because all those pages that it found and created are still on public server logs ?

For Joomla, your robots.txt looks fine to me. With regards to your ranking - lots happened in Google algo in the last year.

If you got the number of your indexed pages which Google *can* crawl down to what pretty much your site has, and then on top you have some other pages indexed, but blocked by robots, and these pages mostly show in the supplementary index (i.e. only shown when you click on repeat the search with omitted results included), then I do not think you have a problem in this area, i.e. I don't think it is these pages blocked by robots that exist in Google supplementary index which cause your ranking issue.

What I am trying to say is that even though your site dropped in ranking when you leaked 700 duplicate pages, fixing all of these pages may not restore the ranking you had before because the algo has been changed since.

member22

12:48 pm on Jan 3, 2014 (gmt 0)

For information the pages that google has in its index such as /administrator / modules/ images with no description doesn't come from server logs it comes from a major bug when google spider got total access to our website and server and starting creating hundred of pages, duplicate and more and indexing everything, there was no limit ( this came due to an upgrade from 1.5 to 2.5 and a bug with one of the module not functioning correctly ) we never figured out which one and had to go back to 1.5 and are now working on starting a new website from scratch ( we learned the lesson never upgrade with joomla ;-)

I understand what you all mean by use a <meta name="robots" content="noindex">

on the pages you don't want indexed but the issue that google has in its indexed probably about a 1000 or more with address that look like this

www.mywebsite.com/module/id-123/ajax ? etc… with a unlimited combination and we don't know which address it has in its index… and this is why we blocked /module folder

We only discover the address as they slowly appear when we type site:www.mywebsite.com

Once we remove 10 pages with the google removal tool 10 new appear… and my worry is that those pages that google has in its indexed even if blocked can hurt ? can it can't it hurt ? I don't think so because there are blocked have a links but … with google we never know...

aakk9999
Thank you for you reply. Under not found the number increases that is true we are somewhere at around 520 pages when the maximum we reached where we had the bug was 765 ( does it mean we need to get back to 765 for the number to be stable ? and not increase anymore and maybe start to see our rankings come back ?

But what I meant by remove is the number that is showed in the WMT in the index status ( our goal is to go back to 38 pages, we were at 765 with the bug and are now at 77 ) so we are almost back to our correct number but at the same time in the not found we are at 520 when the index status says 765 when we had the bug ...

By the way what is the most important the index status or the not found because i don't really which one to look for in terms of importance.

"Google would use the URL it sees in HTML comments for discovery."
I don't see why i would have links added in my html that i didn't add a create so i am not worried about that.

Concerning the algorithm change I agree there are changes but we are 99.,9 % sure our ranking change doesn't come from those algorithm changes… because the changes were actually there when we had the bug.

lucy24

8:18 pm on Jan 3, 2014 (gmt 0)

I have a robots.txt / module but google still shows pages ( with " description not available " ) that include /module in the web address.

You're using the wrong connector. It should be

I have a robots.txt / module AND THEREFORE google shows pages ( with " description not available " ) that include /module in the web address.