Forum Moderators: open
Did anybody see something like that in the logs? I tried a search here at WebmasterWorld and didn't turn up anything. Nor did I see it before in our logs.
This happened a day after a full crawl of our sites, although we have the respective *.js in place since the beginning of the year.
Im curious about what 'Googlebot/Test' is up for.
Well, I haven't seen GBot/Test ingoring robots.txt, to the contrary my logs indicate it does obey robots.txt, but it doesn't read robots.txt often.
So when I just tried to give it the boot for bloating my logfiles, it was to no avail. I guess this bot figures if it's welcomed once, it's okay to come back and pig out and not ask again. And it does have a big appetite for .js files.
However, it's main appetite seems to be javascript files.
A year or so ago, GG commented on people using javascript to hide outgoing links and prevent PR from being passed.
[webmasterworld.com...]
Maybe the bot is checking for that, among other things.
Kaled i did not call you a liar. I apologise though for not saying: "you should be careful of making this definitive statement since your or my experience is not enough to draw definitive conclusions."
You example in fact shows two indexed files that contain JavaScript functions and that are named jscr - btw, the only jscr files i have found at google. What is different from regular external js files:
- these files are named .jscr (is this valid?)
- these files are nowhere specified as JavaScript just <script></script>
- googlebot can't have an idea that these are JavaScript files
- the files could be linked by a href from somewhere
So i don't see this as evidence for GBot indexing Javascript files.
The only links to these files (on my site) are of the form <SCRIPT SRC="script.jscr">
Are you seriously telling me that Google can't identify these files as javascript? I should also point out that GoogleGuy seemed unsurprised when I mentioned that my javascript files were being indexed.
Enough reasons for me to say that you're either wrong or simply just trying to blow another bubble
Kaled i did not call you a liar.
Well, I wasn't wrong. As I said, my javascript files ARE INDEXED.
So was I blowing another bubble? And just exactly what bubbles have I blown in the past?
Kaled.
It has been suggested that I move my js files into another directory and use robots.txt to disallow that directory (by Brett I think, but I could be mistaken).
Perhaps other people do exactly that which would explain why javascript files are not more commonly indexed. I don't know, and frankly, I don't care.
Kaled.
I never heard of external js files with the suffix ".jscr". Didn't find anything about the suffix jscr [google.com] when searching google. Even developer.netscape.com doesn't mention this suffix. And the only posts here at WebmasterWorld that mention jscr [google.com] are two of your posts. O well, i guess i've learned something new.
In future, when someone mentions anomalous behaviour, ask them to sticky you the relevant url(s) - that way you can investigate. Accusing people of being wrong or blowing bubbles does not progress discussions does it?
As for my use of the extension .jscr, if I were a cartoon fan I could use the extension .loonytunes and I believe browsers would still handle it just fine. I think I'm right in saying I could do the same for my html files but I'm not going to waste time trying it.
Kaled.
PS Yes, I have read your sticky mail.
I will.
>I could use the extension .loonytunes and I believe browsers would still handle it just fine.
And GBot would crawl it, i guess. My intention is not to be pedantic. The point i'm trying to make with the suffix discussion is that this is not a w3c standard and therefor google handles it (probably) different. It's - in my eyes - no proof of google indexing javascript!
I will do a test like this the next days:
<script src="test.goofy"></script>
<script src="goofy.js"></script> .. both lines on the same page.
I tend to bet that the .goofy file might get indexed by google while the .js file won't.
it doesn't read robots.txt often.
Your javascripts (jscr) return this:
Content-Type: text/plain while a regular javascript file should return:
Content-type: application/x-javascript Afaik, robot indexing is all about mime types. So in cases a server isn't set up to return the correct mime type for a file, obscure files (without appropriate handlers) get treated by robots / indexers as regular text files and therefor get indexed.
Those of you who see their js files requested by gbot: would you mind to check the content type of these files using the Server Header Checker [searchengineworld.com] and report back?
This would be the correct httpd.conf entry for your files, kaled:
AddType .jscr application/x-javascript AddType .loonytunes application/x-javascript
Now that makes perfect sense. But I'm not sure my host lets me do anything about this. I'll investigate when I have time. Perhaps if I change from .jscr to .js my host will supply the desired headers.
Kaled.
PS I don't know if its relevant, but whilst there are not any/many .js files indexed by Google, there are many entries for www.....myscript.js?param=value
I didn't think to ask if that applied to on-page or external JS, or both. My bad.
This could - like powdork mentioned - very well put an end to js-redirected doorways
unfortunately not
- the redirect could be triggered by a series of events that the bot surely won't be able to simulate when executing the code.
- further more in JS you can write code on-the-fly using eval()
- or think of document.write etc.
Similarly, document.write is not a major problem, however, all this will take a vast quantity of CPU time. It then takes even more CPU power to analyse what's happening, and perhaps more still to simulate user interaction. And then, if Google decide to make use of all this work, presumably, they will ban some pages/sites, but because the algos will be less than perfect, innocent sites will suffer whilst guilty ones will find a way around all this.
Kaled.
Other travel sites appear to have various different techniques to get around this.
I refer to the spider sim here on webmaster world & one called poooodle...
"GET /foresee/stdLauncher.js HTTP/1.1" 302 299 "-" "Googlebot/Test"
and
"GET /foresee/triggerParams.js HTTP/1.1" 302 299 "-" "Googlebot/Test"
There are many more. It may be programmed to look for scripts which are known violators.
I wonder if this will put an end to the js redirected doorway pages that seem to be doing so well now.
While this is the first thing one might think of, I am wondering if this has anything to do with another problem I have seen recently...
Page "cloaking" via client-side Javascript
Where keyword-stuffed spider-food is swapped with the actual page contents, by selectively "commenting-out" the undesired version via a function in an external js file, based on the userAgent.
For example -
var reg1 = /.o.g.e.ot/i;
var reg2 = /sl.rp/i;
var reg3 = /i.k.o.i/i;
...
if (! reg1.test(navigator.userAgent) &&...
document.write...
Where they're doing a pattern match on the userAgent looking for Googlebot.
Note how they seem to be assuming that Googlebot is actually executing the client-side Javascript code - I'm not sure if they know something we don't, or if this is just confused programming.
However, unless someone has their browser's userAgent set to return "Googlebot", the code still has the desired effect. And because the changes are happening client-side, there are no differences in the page that could be spotted by typical anti-cloaking detection methods.
The site where I came across this is getting #1 listings in the SERPs above valid sites. Even worse, the browser version of these cloaked pages are nothing but affiliate spam!
Google may already be trying to address this problem, which might explain the test bot spidering your external Javascript files.
Per the rules, I can't include specifics regarding the site in my post.
However, GoogleGuy, if you're still out there, I would be more than happy to send you the details so you can pass this on the right people, I seem to recall you providing a specific addy/subject line during some of the updates for such purposes, just let me know where and what to title it and I'll be glad to get it to ya.
Within a few weeks the number of pages indexed by google jumped to over a thousand. For many they now ranked very high for fairly common words in their market.
Sometime over the last month or so everthing has dropped. the number of pages and the rank.
Perhaps googlebot is now following the .js links in the menu and nailing them for duplicate content? That would be a sad state of affairs.
Attempting to test for Googlebot as the user agent at the scripting level is close to barking mad.
kaled, I would agree, as I said, confused programming.
However, as I also indicated, the code still works, as the UA for most browsers will not identify themselves as Googlebot!
That is also confirmed by the fact that these cloaked pages are successfully fooling Google and enjoying many #1 listings above legitimate sites, based on contents which do not appear when viewed by a browser.
In fact, the page title and descriptive "snippet" in the Google listing are even different from what you see from a browser! (which is what first drew my attention to these pages)
I also noticed that all the affiliate links I saw in my browser (Commission Junction links to eBay listings) did not appear when I looked at the source later. At first I thought they had changed the page, but a closer inspection revealed a second Javascript function, designed to document.write a fake TITLE, along with an IFRAME containing a whole different "cloaked" page to be viewed by the browser!
The result is the entire cloaked page contents are swapped-out, including the title.
What Googlebot sees is a few external Javascript calls, followed by a web page only visible to the spider.
What a browser sees is a different Javascript-generated Title, a Javascript-generated IFRAME containing a whole different web page, and a very large HTML comment.
Spider sees one page, browser sees another. Classic cloaking, done client-side to evade detection.
Would love to get all the details to Google, so they can deal with this.
GoogleGuy, you still out there?