Google indexing Javascript?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google indexing Javascript?

Indexing my external javascript.

markdidj

12:32 pm on Apr 4, 2005 (gmt 0)

I've just searched in Google for my domain name, and my external javascript has shown up in the index.

This is quite new eh?

It's a server side written asp file that outputs javascript.

mrMister

5:10 pm on Apr 4, 2005 (gmt 0)

Is it definititely googlebot? What is the IP?

Also was it the Mozilla version of googlebot?

g1smd

7:08 pm on Apr 4, 2005 (gmt 0)

If the MIME type is sent as text/html then it may well be indexed.

The correct MIME type for javascript files is text/javascript I believe. Those should not be indexed. I can't see any sense in indexing them at all.

I have no problem with Google scanning the files to see what is in them, and rooting out sites with dodgy redirects to spam, trojans, and malware, but they should not appear in search results.

HughMungus

7:16 pm on Apr 4, 2005 (gmt 0)

The correct MIME type for javascript files is text/javascript I believe. Those should not be indexed. I can't see any sense in indexing them at all.

I can verify that text/javascript IS indexed. I have an include file on one website I include on another and it's listed in the search results. How can I prevent this?

markdidj

10:48 pm on Apr 4, 2005 (gmt 0)

Looked further and my css is indexed as well.

I don't put a header to them as neither needed it in all browsers that I use.

Will putting a correct MIME header stop them from being indexed?

Google has obviously read it from my src of the included javascript file and the href of my style sheets

g1smd

11:42 pm on Apr 4, 2005 (gmt 0)

Well, both of these searches produce zero results:

[google.com...]

encyclo

12:35 am on Apr 5, 2005 (gmt 0)

The only site I have where the Javascript and CSS files were indexed has filenames with no file extension. Thus my Javascript file name is "/j". The file is marked with the URL only and as an unknown format.

From that admittedly limited experience, I would guess that files ending in .js or .css are usually naturally excluded (as they are not fetched), but files with other extensions are fetched but then not indexed (but listed as URL only) if the declared mime type is

application/x-javascript

text/css

. In other words, a file extension other than .js or .css means that Googlebot needs to fetch the file to check the mime type, whereas when those extensions are used, the mime type is assumed and the file is ignored.

You could put such files in a directory excluded by robots.txt to avoid any listing.

Powdork

5:05 am on Apr 5, 2005 (gmt 0)

g1smd
try filetype:html

markdidj

10:33 am on Apr 5, 2005 (gmt 0)

Is there something I could put in the ASP to make it only appear if it has been requested from my site?

claus

10:59 am on Apr 5, 2005 (gmt 0)

Very interesting:

JS: [google.com...]

CSS: [google.com...]

The common denominator seems to be that the URI does not end with "js" or "css"

encyclo

11:40 am on Apr 5, 2005 (gmt 0)

The common denominator seems to be that the URI does not end with "js" or "css"

Yes, as I said if the files don't end in .js or .css then Googlebot has to fetch them to check the mime type and see if they are worth indexing. And once the file is fetched, it is in the index even if it is listed with the URL only.

Looks like that for Google at least, file extensions are as important as mime types.

eyezshine

11:53 am on Apr 5, 2005 (gmt 0)

Google has definately been spidering javascript lately on one of my sites because I have pages that are linked to with javascript links so that the engines wouldn't crawl the urls.

But google indexed those pages with a cache.

The funny thing is, I blocked the directory those pages are in with my robots.txt and even used javascript to link to those pages so that only users could click through but google still indexed the pages.

So I submitted my robots.txt file to the google URL removal tool which got rid of the pages.

g1smd

10:58 pm on Apr 5, 2005 (gmt 0)

I would advise everyone to do a site:yourdomain.com search for their site and then use the Google urlcontroller to remove anything from the index that they do not want listed.

There are options to remove URLs using a robots.txt file, or via meta-tags on the page, as well as removal of pages that no longer exist (if they return a 404 error).

jamsy

11:14 am on Apr 6, 2005 (gmt 0)

Google has definately been spidering javascript lately on one of my sites because I have pages that are linked to with javascript links so that the engines wouldn't crawl the urls

I have had similar happen to me before, even folders that where not linked from anywhere but attributed their listings to the Google Toolbar.

Do you use the toolbar? If so this could be the reason how Google found the pages.

eyezshine

9:38 pm on Apr 6, 2005 (gmt 0)

I use the toolbar occasionally but not much.