Google now indexes javascript

Forum Moderators: open

Message Too Old, No Replies

Google now indexes javascript

proof

claus

7:29 pm on Aug 19, 2003 (gmt 0)

Just got proof five minutes ago, there's been talk for a while i know.

Take a look at the pages in these serps: [google.com...]

/claus

claus

7:18 pm on Aug 20, 2003 (gmt 0)

>> see inside my <script> tags

Well, it might just ignore it although it sees it perfectly well - it's two different things and that was essentially my point.

The script in the title was an exploit a while back i recall. Nevermind, put the poor thread to rest and let's get back to work i'd say...

MonkeeSage

7:38 pm on Aug 20, 2003 (gmt 0)

Dolemite:

If you want to take my posts as trying to pick a fight, that is your perogative, and you are free to respond accordingly. That is not my intention, but your perceptions are beyond my powers.

This is excerpted from one of the pages that Google has the JS indexed for:

<title>Monster Magnet on NME.COM - all the news, reviews, features, discography, ecards, links to top sites and more</title>
<meta NAME="Description" CONTENT="Monster Magnet, on NME.COM - all the news, reviews, features, discography, ecards, links to top sites and more">
<meta NAME="Keywords" CONTENT="Monster Magnet, news, reviews, features, discography, albums, singles, live, gigs, ecards, links, tour dates, images, audio, chat, releases, music, websites, sites, biography, archive, info"><script language="JavaScript1.2">
<!--
// Configure the two variables below to match your site's own info.
var bookmarkurl="http://www.nme.com";
var bookmarktitle="NME.COM - Express Music News";
var MarqueeString = "<a href='/news/105937.htm'>
[...]

claus listed the search results that contain this site, earlier, and pointed out the fact that the indexed JS appears in a valid JS block, which is also commented out.

You sarcastically disavowed the existence of this fact up to this point, and ignored the main point -- viz., that Google CAN index JS and doesn't ignore <script> tags nor comments, but parses them perfectly well. The question is no longer "does it..?"

Shelumi`El
Jordan

S.D.G

Dolemite

7:49 pm on Aug 20, 2003 (gmt 0)

Jordan,

We've been over that one...read the first page.

Arnett

8:56 am on Aug 21, 2003 (gmt 0)

Take a look at the pages in these serps: [google.com...]
Then try [google.com...] and tell me why no .js files show up in the SERP?

claus

9:04 am on Aug 21, 2003 (gmt 0)

>> tell me why no .js files show up in the SERP?

Because Google does not display js files, or css files for that matter. They're excluded - no big deal. It's just like if they chose to exclude PDF's although they don't.

Oh, i might add.. this question begs that the thread has not been read.. JS is still not included in the ordinary SERPS. There has to be some kind of error for it to show up.

I've posted it as a wish on the Google wishlist thread, that's about where we are, so nothing new really except we know that Gbot can read it now.

/claus

Giacomo

9:12 am on Aug 21, 2003 (gmt 0)

The Gbot is fully capable of reading and parsing javascript

I think you are missing the difference between indexing and parsing...

Parsing = syntactic recognition = understanding (client-side) code as such.

There is no evidence that Google has a built-in JS parser so far. I think Googlebot simply tries to follow anything that looks like URL anywhere in the HTML source.

MonkeeSage

9:20 am on Aug 21, 2003 (gmt 0)

Google reads inside of (said: parses NOT interprets):

"<script language="JavaScript1.2">

</script>"

...this is a fact. The evidence was already presented establishing this fact.

Mabye someday they will start putting what they already can read, into the SERPs, intentionally. Wether they do or don't doesn't change the fact though.

Jordan

TallTroll

9:22 am on Aug 21, 2003 (gmt 0)

http:/*/share1.serverspy.net/cgi-bin/monitor.js?mid=3631

Have a peek at that. Where's the HTML error?

claus

9:37 am on Aug 21, 2003 (gmt 0)

Giacomo, parsing does not imply understanding. Parsing is the process in which you break up a string into smaller (workable) parts. What you are after is an extra processing layer that adds some logic to these bits.

A javascript engine like the ones found in browsers has "syntactic recognition" or "understanding" but it still need to do the parsing first. Such a feature is outside the scope of any SE - they do not understand HTML, or PDF for that matter, they just look for patterns.

Here's the SE process very roughly:
(0) Identify documents
(1) Read document content
(2) Parse document content
(3) Index parsed document content
(4) Match (3) to search query
(5) Display relevant subset of (4)

Now, this thread has shown that the whole chain from (0) to (5) can be done for those elements of a HTML document that is javascript. It has also shown that match and display (4-5) has happened as an error because step (0) fails in recognizing the proper document type. Pulling out links from JS is step (4), even if it would be done internally and not displayed. To do this they have to be able to do (0)-(3) first.

It clarifies the question of whether G is able to index JS (and comments) or not. They are. That's it, there's no more to it than that. Basically it should be no big thing either, as it's much simpler than indexing PDF or whatever odd file format they are handling already.

They still choose not to match and display, but that's something else than not being able to index.

http://dictionary.reference.com/search?q=parse
"Computer Science. To analyze or separate (input, for example) into more easily processed components"

/claus

Dolemite

9:43 am on Aug 21, 2003 (gmt 0)

I promised myself I wouldn't come back to this thread... *sigh*

we know that Gbot can read [javascript] now.

I have a problem with that statement. That it could read javascript seems implicit. If its on the page, google can "read" it, regardless of what it is. If it is poorly-formed HTML, it seems normally ignored content may be both read and indexed.

Google reads inside of (said: parses NOT interprets):
"<script language="JavaScript1.2">

</script>"

Yes, it does...when you don't send the right header.

But both of these supposed revelations mean nothing for the 99% of the web that does things 75% correctly.

http:/*/share1.serverspy.net/cgi-bin/monitor.js?mid=3631
Have a peek at that. Where's the HTML error?

The HTML error is that it isn't HTML. ;)

Giacomo

9:50 am on Aug 21, 2003 (gmt 0)

Giacomo, parsing does not imply understanding. Parsing is the process in which you break up a string into smaller (workable) parts. What you are after is an extra processing layer that adds some logic to these bits.

No, I'm just after parsing as syntactic recognition. I said, "understanding [i.e., recognizing] code as such". I did not mention "interpreting" (i.e., executing) code anywhere.

I remain convinced that Google currently does not recognize (=parse) JS code as JS code. It just indexes it as text. <added:>when the HTML is malformed ;) </added>

Giacomo

10:09 am on Aug 21, 2003 (gmt 0)

Try searching Google using the filetype: switch set to "js"....

1. What file types are returned in a Google search? [google.com]

.js is not on Google's list of recognized file types.

Giacomo

10:14 am on Aug 21, 2003 (gmt 0)

Proof that Google does not normally index JavaScript content:

[google.com...]

All of the results have "MM_openBrWindow" either within the

<TITLE>

tag (which is wrong syntax) or in the page's text.

False alarm. End of story.

TallTroll

10:16 am on Aug 21, 2003 (gmt 0)

>> .js is not on Google's list of recognized file types.

LOL, nor is .swf, but behold! [google.com] Don't believe everything Google tells you ;)

They've even got some of 'em tagged with "File Format: Shockwave Flash". They see it, they index it. They index it, there's a way to search it, believe.

[edited by: TallTroll at 11:00 am (utc) on Aug. 21, 2003]

Chndru

10:20 am on Aug 21, 2003 (gmt 0)

guess it this is a good time to lock this thread :)

MonkeeSage

10:32 am on Aug 21, 2003 (gmt 0)

Google reads inside of (said: parses NOT interprets):

"<script language="JavaScript1.2">

</script>"

...this is a fact. The evidence was already presented establishing this fact.

Mabye someday they will start putting what they already can read, into the SERPs, intentionally. Wether they do or don't doesn't change the fact though.

Jordan

Giacomo

10:44 am on Aug 21, 2003 (gmt 0)

They've even got some of 'em tagged with "File Format: Shockwave Flash".

Yeah. Too bad I don't see a "File Format: JavaScript" when I search using

filetype:js

. :-)

Google reads inside of (said: parses NOT interprets):
"<script language="JavaScript1.2">

</script>"
...this is a fact. The evidence was already presented establishing this fact.

To read is not to parse.

My interpretation (based on the only evidence I've seen) of what Google does:

1. Google reads the page's source, parses the HTML and indexes the textual content + any URLs it can find.

2. If the JS code is malformed or misplaced, it is indexed as text as well.

3. From Google's point of view, JS source code is just a comment within a

<script>

tag, and is normally ignored.

Dolemite

10:56 am on Aug 21, 2003 (gmt 0)

No, I'm just after parsing as syntactic recognition. I said, "understanding [i.e., recognizing] code as such". I did not mention "interpreting" (i.e., executing) code anywhere.
I remain convinced that Google currently does not recognize (=parse) JS code as JS code. It just indexes it as text. <added:>when the HTML is malformed ;) </added>

This is a key point. Googlebot just doesn't have the logic to handle the malformed HTML in certain cases. It doesn't "know" its working with JS, its just handling the error as best it can by treating code as plain text.

claus

10:59 am on Aug 21, 2003 (gmt 0)

TallTroll, i was wrong in my other answer, sorry.

The document you just posted is returned as no 2 for this query:
[google.com...]

It is matched only because "monitor" is in the filename. It is not parsed as far as i can see: I can't find any serps for "JoinIn (xip,xgame,xmod,xpass)" or "If you know that the GameLauncher is installed then \n"

The file type is not recognized by Google (this is stated in the SERPS,) although it offers to show it as HTML (!?) The server returns these headers for the doc:

HTTP/1.1 200 OK
Date: Thu, 21 Aug 2003 09:54:43 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.26.1a PHP/4.3.2-RC2 mod_perl/1.27 mod_ssl/2.8.14 OpenSSL/0.9.7a
Connection: close
Transfer-Encoding: chunked
Content-Type: application/x-javascript

This is a javascript document, that was where i was wrong in my other answer. It is clearly only indexed by file name and not content parsed.

Thanks a lot :) It confirms that external javascript files are not recognized by Google, so they are still safe to use for links.

The other document (the HTML doc mentioned in post #8) in which the javascript was parsed and indexed returns these headers:

HTTP/1.1 200 OK
Date: Thu, 21 Aug 2003 10:11:06 GMT
Server: Apache/1.3.26 (Unix) mod_perl/1.27
Set-Cookie: WEBTRENDS=127.0.0.0.some_number; path=/; expires=Sat, 20-Sep-03 10:11:06 GMT
Cache-control: no-cache=Set-cookie
Content-length: 41018
Last-modified: Thu, 21 Aug 2003 09:41:56 GMT
Expires: Thu, 21 Aug 2003 10:11:56 GMT
Connection: close
Content-Type: text/html

This is a HTML document. I have no idea why Google does not recognize it as such. I've been looking at the source code and i can find nothing interesting apart from some odd comments inserted just after <head>. Perhaps there are binary characters, i don't know.

Anyway: comment tags are safe too

I compared the "show as HTML" with the page source and the JS that is indexed is not the JS at the top, it's located far down in the document.

The indexed javascript starts this way:

<script language="JavaScript1.2">
function ScrapbookPopup( url ) {

- so there are no comment tags inside the script tags.

wrap-up (i hope):

- external javascript files seems to be safe
- comments seems to be safe
- on-page javascript will be indexed by accident only, so it seems to be safe as well

The only new thing of this thread remains to be that Gbot (unsurprisingly) is able to index JavaScript. Google still chooses not to whenever possible.

...i guess there's really nothing more to it, sorry about that.

/claus

added:

A lot of new posts during the few hours it took me to write this one... You're all right, i think. JS is nothing but text. It's perfectly possible to index and parse it, but Google chooses to exclude it from whatever subset of the total amount of text that makes it to the SERPS - that's really all we know.

I agree with TallTroll that you can infer only as much about Google's inner workings from the SERPS as you can by judging a book by the cover. Plus, what's true today might not be tomorrow. For today, however, we only know that they seem to refrain from doing something that they can do. It was very likely that they would be able to do it and that's basically confirmed, so there's nothing new really.

The rest remains guesswork and discussions about wording.

[edited by: claus at 11:19 am (utc) on Aug. 21, 2003]

TallTroll

11:04 am on Aug 21, 2003 (gmt 0)

>> Proof that Google does not normally index JavaScript content:

As far as I can tell, the deal is this : if it appears in an href="" somewhere, Gbot will try to scoop it. In the case of .js files, it seems that the requirement is for some arguments to follow the .js (ie ...function.js?parameter=value), so it doesn't end .js. There are a few systems out there that generate some wierd looking URLs with all sorts of odd characters, and its concievable that .js?stuffstuffstuff could be generated in a URL, so they have to accept it.

If that is so, Gbot requests the file, and gets it delivered. As Gbot apparently can't yet execute JS, it gets treated as a text file, more or less. Gbots understanding fo the file naming IS sufficient to make these documents available for selection as a discrete group using the filetype: selector at the search interface.

>> It is clearly only indexed by file name and not content parsed.

Absolutely. Gbot "knows" its JS because its called filename.js, not because its actually executing it

Note that js, swf, exe and even esoteric filetypes like .tar are searchable. Also of great interest should be the response when a non-supported filetype is requested. The Google suggestion page makes reference to your search on "ext:requestedfiletype". Trying another search that replaces the "filetype:" switch with "ext:" produces identical results, implying that "ext:" is the internal operator use. Assuming that ext is short for extension, it is seen that Google does categorise documents by file extension, and they don't always tell you which ones they support, just which ones they want you to search for

Ahh, this thread reminds me of "the Old Days" you ornery whippersnappers, yeeha, one of the best discussions I've seen in ages

MonkeeSage

11:12 am on Aug 21, 2003 (gmt 0)

[...] JS source code is just a comment within a <script> tag, and is normally ignored.

Yup, that's what I used to think untill I saw that Google is not ignoring it and is parsing right through it.

If there were a rule to ignore <script> blocks or comments, then they would have been ignored under all circumstances if they were found. Well, they were found. They were not ignored. Ergo, Google doesn't seem to have any such rule.

The idea that "JS source code is [...] normally ignored" is just an educated inference from the SERPs, it is not an absolute fact strait from the Google's mouth (so to speak).

There is now new evidence that warrents the opposite inference -- that Google doesn't just ignore script blocks and comments, but rather parses them.

What do they do with all the parsed blocks? That's anyone's guess. For all we know they throw them away immediately after identifying and parsing so that they have only HTML in their database. Who knows. I sure don't.

Jordan

Dolemite

11:15 am on Aug 21, 2003 (gmt 0)

The other document (the HTML doc mentioned in post #8) in which the javascript was parsed and indexed returns these headers:
HTTP/1.1 200 OK
Date: Thu, 21 Aug 2003 10:11:06 GMT
Server: Apache/1.3.26 (Unix) mod_perl/1.27
Set-Cookie: WEBTRENDS=127.0.0.0.some_number; path=/; expires=Sat, 20-Sep-03 10:11:06 GMT
Cache-control: no-cache=Set-cookie
Content-length: 41018
Last-modified: Thu, 21 Aug 2003 09:41:56 GMT
Expires: Thu, 21 Aug 2003 10:11:56 GMT
Connection: close
Content-Type: text/html

This is a HTML document. I have no idea why Google does not recognize it as such. I've been looking at the source code and i can find nothing interesting apart from some odd comments inserted just after <head>. Perhaps there are binary characters, i don't know.

As previously mentioned, it sometimes sends no headers. I would suggest that that's what googlebot got, and thus decided to index it as plain text.

Giacomo

11:17 am on Aug 21, 2003 (gmt 0)

>> Proof that Google does not normally index JavaScript content:
As far as I can tell, the deal is this : if it appears in an href="" somewhere, Gbot will try to scoop it.

Of course. Googlebot will follow anything resembling a URL, anywhere. The reason why Gbot is crawling .swf and .js files is to see if it can extract other URLs to crawl. URL parsing is probably the only kind parsing that is performed on .swf and .js files.

<edit reason>grammar</edit reason>

[edited by: Giacomo at 12:25 pm (utc) on Aug. 21, 2003]

Giacomo

11:19 am on Aug 21, 2003 (gmt 0)

Yup, that's what I used to think untill I saw that Google is not ignoring it and is parsing right through it.

Show proof, please.

MonkeeSage

11:28 am on Aug 21, 2003 (gmt 0)

Giacomo:

The website referenced in post #8, p. 1 of this thread.

Also, the idea that it was taken as a plain-text document and that is why the JS was parsed, while theoretically possible, doesn't seem to actually be the case -- try searching for some of the actual markup on the page (which would have also been indexed as plain-text on such an explanation) -- finds no matches that I can see...

Jordan

Dolemite

11:31 am on Aug 21, 2003 (gmt 0)

The idea that "JS source code is [...] normally ignored" is just an educated inference from the SERPs, it is not an absolute fact strait from the Google's mouth (so to speak).

You're making an equally (if not more) unfounded inference with the following:

If there were a rule to ignore <script> blocks or comments, then they would have been ignored under all circumstances if they were found.
(my emphasis)

You have no idea what sort of pattern matching is occurring. No doubt there are a quite complex set of rules in place to handle the wide variety of HTML in various states of compliance.

All we can say is that the pattern matching "breaks" in its attempts to avoid indexing JS when that JS occurs at a certain level of erroneousness. Not every error can be anticipated, so perhaps Googlebot reacts exactly as intended in such situations. We can't know that without knowing the programming spec, but I believe we can safely assume that Googlebot is not intended to index JS, as indexing is not the general case.

Giacomo

12:16 pm on Aug 21, 2003 (gmt 0)

The website referenced in post #8, p. 1 of this thread.

Click on View as HTML [216.239.51.104]. See? No

<script>

tag on that page. That's why Google indexed the JS source as plain text, and that's why that page (and thousands of other pages on that site) match

document.cookie

[google.com].

What Dolemite said. :)

claus

12:54 pm on Aug 21, 2003 (gmt 0)

damn, why didn't i think of viewing source of the google cache.. had both freakin' documents open right in front of me for two hours straight but just didn't think of that... boy do i feel stupid right now. There's nothing to discuss, the G cache shows what the Gbot saw.. sort of:

<td><p><font size="3" face="roman">I <B style="color:black;background-color:#A0FFFF">Monster</B> Extras
function ScrapbookPopup( url ) {

- so it's pure text, nothing else. In roman font, even. This code is added by Google (searchterm; ref "sort of"):

Well, thanks for all the critical feedback anyway, it's nice to be proven wrong in this case, although it has made me even more aware of the fact that even big G does not display all that it could.

/claus

Giacomo

1:21 pm on Aug 21, 2003 (gmt 0)

> Click on View as HTML. See? No <script> tag

Uhm... well, upon further review, maybe the original page had the

<script>

tag in place (the "View as HTML" version may be slightly different from the regular "Cached" snapshot AFAIK -- and the current version of that page [nme.com] seems to be OK).

If that is the case, the JS code on that page must have been indexed as plain text because Gbot encountered some other kind of error (probably because of missing/wrong server headers, as RonPK said above [webmasterworld.com]).

Server headers testing seems to confirm RonPK's theory:

Using HTTP/1.1:

Sent data:

HEAD /reviews/8192.htm HTTP/1.1
Host: www.nme.com
Connection: close
Accept: */*
User-Agent: WebBug/5.0

Received data:


HTTP/1.1 200 OK 
Date: Thu, 21 Aug 2003 13:14:17 GMT 
Server: Apache/1.3.26 (Unix) mod_perl/1.27 
Set-Cookie: WEBTRENDS=62.123.53.31.121301061471657714; path=/; expires=Sat, 20-Sep-03 13:14:17 GMT 
Cache-control: no-cache=Set-cookie 
Content-length: 26104 
Last-modified: Thu, 21 Aug 2003 12:45:50 GMT 
Expires: Thu, 21 Aug 2003 13:15:50 GMT 
Connection: close 
Content-Type: text/html

Using HTTP/1.0:

Sent data:

HEAD /reviews/8192.htm HTTP/1.0
Accept: */*
User-Agent: WebBug/5.0

Received data:


HTTP/1.1 302 Found 
Date: Thu, 21 Aug 2003 13:15:34 GMT 
Server: Apache/1.3.26 (Unix) mod_perl/1.27 
Location: [nme.com...]  
Connection: close 
Content-Type: text/html; charset=iso-8859-1

...So Googlebot (using HTTP/1.0) might have stumbled upon that "302 Found" (recursively redirecting to the same URL).

[edited by: Giacomo at 1:33 pm (utc) on Aug. 21, 2003]

MonkeeSage

1:31 pm on Aug 21, 2003 (gmt 0)

Giacomo:

Thank you for insight to look at the cache. :)

Google only indexed it because it thought it was text, on account of some kind of error. We already know that from all the normal SERPs.

But there is something interesting about the cached copy now that I've looked at it. It prima facie looks like a document.write() error caused the garbled mess at the bottom, but on closer inspection it is not (try turning off JS and viewing the page again, no difference, it's in the actual cached copy).

It includes a function definition "function ScrapbookPopup( url ) {" -- that is to say, it seems to be (partially) an excerpt from a JS block somewhere. Interesting, no? How did a piece of JS end up being seen by Google as page text, abstracted from the JS block?

But I'll leave us to our own speculates about that. :)

Jordan

Ps. claus, the font tags are generated by Google to highlite the words it found, not part of the original document.

This 61 message thread spans 3 pages: 61