"Googlebot/Test" Pulling External Javascipt?

Forum Moderators: open

Message Too Old, No Replies

"Googlebot/Test" Pulling External Javascipt?

Google seems to be testing something to check about javascript

jazzbo

3:16 pm on Mar 18, 2004 (gmt 0)

64.68.89.190 - - [18/Mar/2004:12:03:53 +0100] "GET /check-inkc.js HTTP/1.1" 200 463 "-" "Googlebot/Test"

Did anybody see something like that in the logs? I tried a search here at WebmasterWorld and didn't turn up anything. Nor did I see it before in our logs.

This happened a day after a full crawl of our sites, although we have the respective *.js in place since the beginning of the year.

Im curious about what 'Googlebot/Test' is up for.

mayor

5:45 am on Mar 21, 2004 (gmt 0)

>> GBot/Test ignoring robots.txt

Well, I haven't seen GBot/Test ingoring robots.txt, to the contrary my logs indicate it does obey robots.txt, but it doesn't read robots.txt often.

So when I just tried to give it the boot for bloating my logfiles, it was to no avail. I guess this bot figures if it's welcomed once, it's okay to come back and pig out and not ask again. And it does have a big appetite for .js files.

crobb305

8:26 am on Mar 21, 2004 (gmt 0)

However, it's main appetite seems to be javascript files.

A year or so ago, GG commented on people using javascript to hide outgoing links and prevent PR from being passed.

[webmasterworld.com...]

Maybe the bot is checking for that, among other things.

Yidaki

9:40 am on Mar 21, 2004 (gmt 0)

>Yidaki,
>If you've read many of my posts you would know that I very
>rarely make definitive statements. When I do, you can bet
>your life that I can back them up.
>When you have read and verified the stickymail I am about
>to send you, I trust you will apologise for calling me a liar.

Kaled i did not call you a liar. I apologise though for not saying: "you should be careful of making this definitive statement since your or my experience is not enough to draw definitive conclusions."

You example in fact shows two indexed files that contain JavaScript functions and that are named jscr - btw, the only jscr files i have found at google. What is different from regular external js files:

- these files are named .jscr (is this valid?)
- these files are nowhere specified as JavaScript just <script></script>
- googlebot can't have an idea that these are JavaScript files
- the files could be linked by a href from somewhere

So i don't see this as evidence for GBot indexing Javascript files.

kaled

10:53 am on Mar 21, 2004 (gmt 0)

- these files are named .jscr (is this valid?)
YES. Why would you believe otherwise?
- these files are nowhere specified as JavaScript just <script></script>
- googlebot can't have an idea that these are JavaScript files
- the files could be linked by a href from somewhere

The only links to these files (on my site) are of the form <SCRIPT SRC="script.jscr">

Are you seriously telling me that Google can't identify these files as javascript? I should also point out that GoogleGuy seemed unsurprised when I mentioned that my javascript files were being indexed.

Enough reasons for me to say that you're either wrong or simply just trying to blow another bubble

Kaled i did not call you a liar.

Well, I wasn't wrong. As I said, my javascript files ARE INDEXED.

So was I blowing another bubble? And just exactly what bubbles have I blown in the past?

Kaled.

kaled

10:59 am on Mar 21, 2004 (gmt 0)

One more thought.

It has been suggested that I move my js files into another directory and use robots.txt to disallow that directory (by Brett I think, but I could be mistaken).

Perhaps other people do exactly that which would explain why javascript files are not more commonly indexed. I don't know, and frankly, I don't care.

Kaled.

Yidaki

11:17 am on Mar 21, 2004 (gmt 0)

>> these files are named .jscr (is this valid?)
>YES. Why would you believe otherwise?

I never heard of external js files with the suffix ".jscr". Didn't find anything about the suffix jscr [google.com] when searching google. Even developer.netscape.com doesn't mention this suffix. And the only posts here at WebmasterWorld that mention jscr [google.com] are two of your posts. O well, i guess i've learned something new.

kaled

12:21 pm on Mar 21, 2004 (gmt 0)

Yidaki,

In future, when someone mentions anomalous behaviour, ask them to sticky you the relevant url(s) - that way you can investigate. Accusing people of being wrong or blowing bubbles does not progress discussions does it?

As for my use of the extension .jscr, if I were a cartoon fan I could use the extension .loonytunes and I believe browsers would still handle it just fine. I think I'm right in saying I could do the same for my html files but I'm not going to waste time trying it.

Kaled.

PS Yes, I have read your sticky mail.

Yidaki

12:34 pm on Mar 21, 2004 (gmt 0)

>In future, ...

I will.

>I could use the extension .loonytunes and I believe browsers would still handle it just fine.

And GBot would crawl it, i guess. My intention is not to be pedantic. The point i'm trying to make with the suffix discussion is that this is not a w3c standard and therefor google handles it (probably) different. It's - in my eyes - no proof of google indexing javascript!

I will do a test like this the next days:


<script src="test.goofy"></script>
<script src="goofy.js"></script>

.. both lines on the same page.

I tend to bet that the .goofy file might get indexed by google while the .js file won't.

bull

1:19 pm on Mar 21, 2004 (gmt 0)

it doesn't read robots.txt often.

True. On my main site, it read robots.txt on Mar 5 and 19.
The bot visited that site on Mar 5, 15, 16, 17, 19 and obeyed the robots.txt that excludes the directory where all .js files are stored. One of these files is called by the html file notoriously requested by the bot.

Yidaki

1:23 pm on Mar 21, 2004 (gmt 0)

Kaled, i guess i found an explanation for your observation (google indexing js): Mime Type / Content Type.

Your javascripts (jscr) return this:

Content-Type: text/plain

while a regular javascript file should return:

Content-type: application/x-javascript

Afaik, robot indexing is all about mime types. So in cases a server isn't set up to return the correct mime type for a file, obscure files (without appropriate handlers) get treated by robots / indexers as regular text files and therefor get indexed.

Those of you who see their js files requested by gbot: would you mind to check the content type of these files using the Server Header Checker [searchengineworld.com] and report back?

This would be the correct httpd.conf entry for your files, kaled:

AddType .jscr application/x-javascript

AddType .loonytunes application/x-javascript

kaled

1:43 pm on Mar 21, 2004 (gmt 0)

Thanks,

Now that makes perfect sense. But I'm not sure my host lets me do anything about this. I'll investigate when I have time. Perhaps if I change from .jscr to .js my host will supply the desired headers.

Kaled.

PS I don't know if its relevant, but whilst there are not any/many .js files indexed by Google, there are many entries for www.....myscript.js?param=value

Glacai

2:53 pm on Mar 21, 2004 (gmt 0)

Yidaki, this bot has grabbed my .js files and all return the correct content type.

mayor

8:31 pm on Mar 21, 2004 (gmt 0)

This just in ... log files from yesterday show a sharp drop in Google traffic for the three sites where googlebot/test just sucked up .js files.

Anyone else see this? Looks like too sharp of a drop to blame on coincidence.

sherwoodseo

5:07 pm on Mar 22, 2004 (gmt 0)

I have clients with Javascript nav menus, and bless their hearts, they don't wanna change 'em. So I asked Matt Cutts at PubCon/Boston if Google does anything with JS. He said that if Googlebot sees a full, absolute http URL in JS (or in any text, for that matter) it'll recognize it.

I didn't think to ask if that applied to on-page or external JS, or both. My bad.

bull

6:46 pm on Mar 22, 2004 (gmt 0)

bot changed UA:

Googlebot/Test (+http://www.googlebot.com/bot.html)

64.68.89.190

jayq

5:57 am on Mar 23, 2004 (gmt 0)

hopefully it's trying to spider the links in dropdown boxes

fabfurs

10:43 pm on Mar 23, 2004 (gmt 0)

2004-03-23 22:19:56 64.68.89.176 - 123.123.123.123 443 GET /robots.txt - 200 0 2375 170 www.widgets.com Googlebot/Test+(+http://www.googlebot.com/bot.html)

Is now getting files with SSL (443), thought Google didn't bother with our secure pages...

plasma

12:39 am on Mar 24, 2004 (gmt 0)

This could - like powdork mentioned - very well put an end to js-redirected doorways

unfortunately not

- the redirect could be triggered by a series of events that the bot surely won't be able to simulate when executing the code.
- further more in JS you can write code on-the-fly using eval()
- or think of document.write etc.

kaled

1:37 am on Mar 24, 2004 (gmt 0)

If Google have the makings of a javascript interpreter (albeit in beta perhaps) then the eval function is no major problem. Eval simply requires a little careful planning, it does not require vast tracts of code.

Similarly, document.write is not a major problem, however, all this will take a vast quantity of CPU time. It then takes even more CPU power to analyse what's happening, and perhaps more still to simulate user interaction. And then, if Google decide to make use of all this work, presumably, they will ban some pages/sites, but because the algos will be less than perfect, innocent sites will suffer whilst guilty ones will find a way around all this.

Kaled.

Brett_Tabke

10:08 am on Mar 24, 2004 (gmt 0)

> the redirect could be triggered by a series of events that the bot surely won't be able to simulate when executing the code.

sure it is - mozilla source code is open source. simply put a trap on the redirect routine.

plasma

2:25 pm on Mar 24, 2004 (gmt 0)

sure it is - mozilla source code is open source. simply put a trap on the redirect routine.

The redirect won't be triggered if executed by a bot, so how shall a breakpoint help?
The code doesnt exist at loading time and will only be assembled after user-interaction.

jentotaltravel

8:12 am on Apr 1, 2004 (gmt 0)

Has anyone come to any sort of conclusion as the whether google bot spiders the text in drop-down java script menus & includes the text as actual body text?
I am working on a travel site & spider simulators show that the huge list of ferry routes for example is counted as body text.

Other travel sites appear to have various different techniques to get around this.

I refer to the spider sim here on webmaster world & one called poooodle...

Alfa90815

2:37 pm on Apr 16, 2004 (gmt 0)

I noticed that it is indexing email addresses that I place in external .js files, but recently it searched for about 15 .js and .asp files that have never existed on my sites. It even included specific subdirectories which do not exist on my sites. Here are a couple of examples:

"GET /foresee/stdLauncher.js HTTP/1.1" 302 299 "-" "Googlebot/Test"

and

"GET /foresee/triggerParams.js HTTP/1.1" 302 299 "-" "Googlebot/Test"

There are many more. It may be programmed to look for scripts which are known violators.

aspdesigner

9:26 am on Apr 20, 2004 (gmt 0)

I wonder if this will put an end to the js redirected doorway pages that seem to be doing so well now.

While this is the first thing one might think of, I am wondering if this has anything to do with another problem I have seen recently...

Page "cloaking" via client-side Javascript

Where keyword-stuffed spider-food is swapped with the actual page contents, by selectively "commenting-out" the undesired version via a function in an external js file, based on the userAgent.

For example -


var reg1 = /.o.g.e.ot/i; 
var reg2 = /sl.rp/i; 
var reg3 = /i.k.o.i/i; 
... 
if (! reg1.test(navigator.userAgent) &&...  
 document.write...

Where they're doing a pattern match on the userAgent looking for Googlebot.

Note how they seem to be assuming that Googlebot is actually executing the client-side Javascript code - I'm not sure if they know something we don't, or if this is just confused programming.

However, unless someone has their browser's userAgent set to return "Googlebot", the code still has the desired effect. And because the changes are happening client-side, there are no differences in the page that could be spotted by typical anti-cloaking detection methods.

The site where I came across this is getting #1 listings in the SERPs above valid sites. Even worse, the browser version of these cloaked pages are nothing but affiliate spam!

Google may already be trying to address this problem, which might explain the test bot spidering your external Javascript files.

Per the rules, I can't include specifics regarding the site in my post.

However, GoogleGuy, if you're still out there, I would be more than happy to send you the details so you can pass this on the right people, I seem to recall you providing a specific addy/subject line during some of the updates for such purposes, just let me know where and what to title it and I'll be glad to get it to ya.

Abigail

3:50 pm on Apr 20, 2004 (gmt 0)

Well, I don't use java, I don't use redirects, just old fashioned HTML, last night the Googlebot/Test dined on many of my pages - all were ones that I lost titles and descriptions for about a month ago, now. Let's hope this means there is a light at the end of that tunnel!

kaled

5:35 pm on Apr 20, 2004 (gmt 0)

Attempting to test for Googlebot as the user agent at the scripting level is close to barking mad. There is no reason whatsoever why any scripting engine used by Google would identify itself as anything other than, say, IE6 or Opera, or null or whatever.

Kaled.

chisholmd

6:02 pm on Apr 20, 2004 (gmt 0)

About 6 months ago a company asked me to help with their search engine placement. I notice that they had about 1,200 page almost none of which were indexed in google. I surmised that the javascript drop down menu was the problem and wrote a server side script to produce moderatly optimized psuedo-static pages named more specifically.
(e.g. category-subcategory-product.asp)
And of course a site map page to link to all those new pages.

Within a few weeks the number of pages indexed by google jumped to over a thousand. For many they now ranked very high for fairly common words in their market.

Sometime over the last month or so everthing has dropped. the number of pages and the rank.

Perhaps googlebot is now following the .js links in the menu and nailing them for duplicate content? That would be a sad state of affairs.

aspdesigner

10:06 pm on Apr 20, 2004 (gmt 0)

Attempting to test for Googlebot as the user agent at the scripting level is close to barking mad.

kaled, I would agree, as I said, confused programming.

However, as I also indicated, the code still works, as the UA for most browsers will not identify themselves as Googlebot!

That is also confirmed by the fact that these cloaked pages are successfully fooling Google and enjoying many #1 listings above legitimate sites, based on contents which do not appear when viewed by a browser.

In fact, the page title and descriptive "snippet" in the Google listing are even different from what you see from a browser! (which is what first drew my attention to these pages)

I also noticed that all the affiliate links I saw in my browser (Commission Junction links to eBay listings) did not appear when I looked at the source later. At first I thought they had changed the page, but a closer inspection revealed a second Javascript function, designed to document.write a fake TITLE, along with an IFRAME containing a whole different "cloaked" page to be viewed by the browser!

The result is the entire cloaked page contents are swapped-out, including the title.

What Googlebot sees is a few external Javascript calls, followed by a web page only visible to the spider.

What a browser sees is a different Javascript-generated Title, a Javascript-generated IFRAME containing a whole different web page, and a very large HTML comment.

Spider sees one page, browser sees another. Classic cloaking, done client-side to evade detection.

Would love to get all the details to Google, so they can deal with this.

GoogleGuy, you still out there?

This 58 message thread spans 2 pages: 58