Google Indexing My Javascript!

Forum Moderators: open

Message Too Old, No Replies

Google Indexing My Javascript!

And Not Following Robots.txt!

Vetteman

3:10 am on Jan 26, 2005 (gmt 0)

On my content site I use some technical words, so when you click on some of these words, I have a javascript pop up containing the definition of these words.

A few months back, I found all of these html pages in Google's index, and they are only linked by javascript links.

I also have a separate glossary section containing some of these same definitions, but on different pages (these pages use my site template). I was annoyed that Google indexed the popups, potentially exposing me to duplicate content penalties.

I placed these pop up html pages under the robots.txt, but months later they are still in Google's index. Worse yet, I found that over 50 of them have a PR of 3.

What should I do? I don't want to leak Pagerank to those pages and I don't want a duplicate content penalty!

macdave

4:26 pm on Jan 26, 2005 (gmt 0)

I believe that Google has been following URLs within JavaScript for quite some time now.

In any case, you can use Google's URL removal tool to get those pages out of the index: [google.com ]

encyclo

4:53 pm on Jan 26, 2005 (gmt 0)

You should put your robots.txt throught the validator [searchengineworld.com] to chack for any problems with your syntax. Also, it would be a good move to add a robots meta tag to the popup pages:

<meta name="robots" content="noindex,nofollow">

As you have seen, just using a Javascript link is not enough to stop Googlebot.

egomaniac

5:18 pm on Jan 26, 2005 (gmt 0)

Are the javascript links in the html code, or are they in external "remote" javascript files?

Putting them in remote files will keep the Googlebot from indexing them and leaking pagerank through them. Its not an issue of whether they can or cannot... but they do not when the code is put into a remote file

walkman

5:37 pm on Jan 26, 2005 (gmt 0)

it could be that Google is looking to have 12 Billion pages. Last time when they "needed" 8 billion for the press releases, robots.txt was ignored, and years-old pages were revived.

BReflection

6:11 pm on Jan 26, 2005 (gmt 0)

it could be that Google is looking to have 12 Billion pages. Last time when they "needed" 8 billion for the press releases, robots.txt was ignored, and years-old pages were revived.

cite?

walkman

6:22 pm on Jan 26, 2005 (gmt 0)

"cite?"
here's one. I've seen many similar reports all over webmaster boards:
[webmasterworld.com...]

Birdman

7:02 pm on Jan 26, 2005 (gmt 0)

Robots.txt wil not stop Google from indexing URIs. It only tells them not to read the page. They will still index the links they find while parsing the "crawlable" area of your site.

I believe encyclo's solution is the best, which reminds me... I need to implement this myself.

BigDave

7:12 pm on Jan 26, 2005 (gmt 0)

robots.txt exclusion will not cause them to remove the files that are already in the index. All it does is keep them fromcrawling those pages again.

get rid of the robots exclusion and add meta robots noindex to the files.

If you are concerned about PR leaks, put small link back to the home page in a copyright notice at the bottom of each of those pages.

I am not surprised that they found your links. Tey even find URLs in the regular text to follow. What they don't (or I better make that "didn't") do was to pass PR through any of those links.

Try adding a rel="nofollow" to the links at this point.

Are you sure that no one else is linking to those pages?

Vetteman

3:57 am on Jan 27, 2005 (gmt 0)

Thanks for your help, everyone!

I will definately use this meta tag: <meta name="robots" content="noindex,nofollow"> .

I will try to post about any changes.

RossWal

5:26 am on Jan 27, 2005 (gmt 0)

What about visitors with the tool bar installed? Couldn't that be a way that seemingly orphaned pages get discovered? Wouldn't explain the PR though!

larryhatch

5:45 am on Jan 27, 2005 (gmt 0)

I hope this isn't off the thread, BUT:

IF Google is indexing javascript, is there any chance that means that G will pass PR
thru phony JS links to valid sites instead of leaving it to the JS PR hoggers? -LH

Vetteman

2:24 pm on Jan 27, 2005 (gmt 0)

Here is the type of link Googlebot is following: <a href="javascript:popUp('/gloss/leveragedbuyout.htm')">Leveraged Buyout</a> . My lack of technical savvy could mean that these are standard href links which Google does follow- forgive me if that's the case. I keep my javascript in an external file.

Will Google see my site as less authoritative if I try to stop some pages from being indexed in this manner? I'm not hiding anything or doing anything devious, but you never know with Google.