>> GBot/Test ignoring robots.txt
Well, I haven't seen GBot/Test ingoring robots.txt, to the contrary my logs indicate it does obey robots.txt, but it doesn't read robots.txt often.
So when I just tried to give it the boot for bloating my logfiles, it was to no avail. I guess this bot figures if it's welcomed once, it's okay to come back and pig out and not ask again. And it does have a big appetite for .js files.
Maybe the bot is checking for that, among other things.
>If you've read many of my posts you would know that I very
>rarely make definitive statements. When I do, you can bet
>your life that I can back them up.
>When you have read and verified the stickymail I am about
>to send you, I trust you will apologise for calling me a liar.
Kaled i did not call you a liar. I apologise though for not saying: "you should be careful of making this definitive statement since your or my experience is not enough to draw definitive conclusions."
- these files are named .jscr (is this valid?)
- the files could be linked by a href from somewhere
- these files are named .jscr (is this valid?)
YES. Why would you believe otherwise?
- the files could be linked by a href from somewhere
The only links to these files (on my site) are of the form <SCRIPT SRC="script.jscr">
|Enough reasons for me to say that you're either wrong or simply just trying to blow another bubble |
|Kaled i did not call you a liar. |
So was I blowing another bubble? And just exactly what bubbles have I blown in the past?
One more thought.
It has been suggested that I move my js files into another directory and use robots.txt to disallow that directory (by Brett I think, but I could be mistaken).
>> these files are named .jscr (is this valid?)
>YES. Why would you believe otherwise?
I never heard of external js files with the suffix ".jscr". Didn't find anything about the suffix jscr [google.com] when searching google. Even developer.netscape.com doesn't mention this suffix. And the only posts here at WebmasterWorld that mention jscr [google.com] are two of your posts. O well, i guess i've learned something new.
In future, when someone mentions anomalous behaviour, ask them to sticky you the relevant url(s) - that way you can investigate. Accusing people of being wrong or blowing bubbles does not progress discussions does it?
As for my use of the extension .jscr, if I were a cartoon fan I could use the extension .loonytunes and I believe browsers would still handle it just fine. I think I'm right in saying I could do the same for my html files but I'm not going to waste time trying it.
PS Yes, I have read your sticky mail.
>In future, ...
>I could use the extension .loonytunes and I believe browsers would still handle it just fine.
I will do a test like this the next days:
.. both lines on the same page.
I tend to bet that the .goofy file might get indexed by google while the .js file won't.
|it doesn't read robots.txt often. |
True. On my main site, it read robots.txt on Mar 5 and 19.
The bot visited that site on Mar 5, 15, 16, 17, 19 and obeyed the robots.txt that excludes the directory where all .js files are stored. One of these files is called by the html file notoriously requested by the bot.
Kaled, i guess i found an explanation for your observation (google indexing js): Mime Type / Content Type.
Afaik, robot indexing is all about mime types. So in cases a server isn't set up to return the correct mime type for a file, obscure files (without appropriate handlers) get treated by robots / indexers as regular text files and therefor get indexed.
Those of you who see their js files requested by gbot: would you mind to check the content type of these files using the Server Header Checker [searchengineworld.com] and report back?
This would be the correct httpd.conf entry for your files, kaled:
Now that makes perfect sense. But I'm not sure my host lets me do anything about this. I'll investigate when I have time. Perhaps if I change from .jscr to .js my host will supply the desired headers.
PS I don't know if its relevant, but whilst there are not any/many .js files indexed by Google, there are many entries for www.....myscript.js?param=value
Yidaki, this bot has grabbed my .js files and all return the correct content type.
This just in ... log files from yesterday show a sharp drop in Google traffic for the three sites where googlebot/test just sucked up .js files.
Anyone else see this? Looks like too sharp of a drop to blame on coincidence.
I didn't think to ask if that applied to on-page or external JS, or both. My bad.
bot changed UA:
hopefully it's trying to spider the links in dropdown boxes
2004-03-23 22:19:56 126.96.36.199 - 188.8.131.52 443 GET /robots.txt - 200 0 2375 170 www.widgets.com Googlebot/Test+(+http://www.googlebot.com/bot.html)
Is now getting files with SSL (443), thought Google didn't bother with our secure pages...
|This could - like powdork mentioned - very well put an end to js-redirected doorways |
- the redirect could be triggered by a series of events that the bot surely won't be able to simulate when executing the code.
- further more in JS you can write code on-the-fly using eval()
- or think of document.write etc.
Similarly, document.write is not a major problem, however, all this will take a vast quantity of CPU time. It then takes even more CPU power to analyse what's happening, and perhaps more still to simulate user interaction. And then, if Google decide to make use of all this work, presumably, they will ban some pages/sites, but because the algos will be less than perfect, innocent sites will suffer whilst guilty ones will find a way around all this.
> the redirect could be triggered by a series of events that the bot surely won't be able to simulate when executing the code.
sure it is - mozilla source code is open source. simply put a trap on the redirect routine.
|sure it is - mozilla source code is open source. simply put a trap on the redirect routine. |
The redirect won't be triggered if executed by a bot, so how shall a breakpoint help?
The code doesnt exist at loading time and will only be assembled after user-interaction.
Has anyone come to any sort of conclusion as the whether google bot spiders the text in drop-down java script menus & includes the text as actual body text?
I am working on a travel site & spider simulators show that the huge list of ferry routes for example is counted as body text.
Other travel sites appear to have various different techniques to get around this.
I refer to the spider sim here on webmaster world & one called poooodle...
I noticed that it is indexing email addresses that I place in external .js files, but recently it searched for about 15 .js and .asp files that have never existed on my sites. It even included specific subdirectories which do not exist on my sites. Here are a couple of examples:
"GET /foresee/stdLauncher.js HTTP/1.1" 302 299 "-" "Googlebot/Test"
"GET /foresee/triggerParams.js HTTP/1.1" 302 299 "-" "Googlebot/Test"
There are many more. It may be programmed to look for scripts which are known violators.
I wonder if this will put an end to the js redirected doorway pages that seem to be doing so well now.
While this is the first thing one might think of, I am wondering if this has anything to do with another problem I have seen recently...
Where keyword-stuffed spider-food is swapped with the actual page contents, by selectively "commenting-out" the undesired version via a function in an external js file, based on the userAgent.
For example -
var reg1 = /.o.g.e.ot/i;
var reg2 = /sl.rp/i;
var reg3 = /i.k.o.i/i;
if (! reg1.test(navigator.userAgent) &&...
Where they're doing a pattern match on the userAgent looking for Googlebot.
However, unless someone has their browser's userAgent set to return "Googlebot", the code still has the desired effect. And because the changes are happening client-side, there are no differences in the page that could be spotted by typical anti-cloaking detection methods.
The site where I came across this is getting #1 listings in the SERPs above valid sites. Even worse, the browser version of these cloaked pages are nothing but affiliate spam!
Per the rules, I can't include specifics regarding the site in my post.
However, GoogleGuy, if you're still out there, I would be more than happy to send you the details so you can pass this on the right people, I seem to recall you providing a specific addy/subject line during some of the updates for such purposes, just let me know where and what to title it and I'll be glad to get it to ya.
Well, I don't use java, I don't use redirects, just old fashioned HTML, last night the Googlebot/Test dined on many of my pages - all were ones that I lost titles and descriptions for about a month ago, now. Let's hope this means there is a light at the end of that tunnel!
Attempting to test for Googlebot as the user agent at the scripting level is close to barking mad. There is no reason whatsoever why any scripting engine used by Google would identify itself as anything other than, say, IE6 or Opera, or null or whatever.
And of course a site map page to link to all those new pages.
Within a few weeks the number of pages indexed by google jumped to over a thousand. For many they now ranked very high for fairly common words in their market.
Sometime over the last month or so everthing has dropped. the number of pages and the rank.
Perhaps googlebot is now following the .js links in the menu and nailing them for duplicate content? That would be a sad state of affairs.
Attempting to test for Googlebot as the user agent at the scripting level is close to barking mad.
kaled, I would agree, as I said, confused programming.
However, as I also indicated, the code still works, as the UA for most browsers will not identify themselves as Googlebot!
That is also confirmed by the fact that these cloaked pages are successfully fooling Google and enjoying many #1 listings above legitimate sites, based on contents which do not appear when viewed by a browser.
In fact, the page title and descriptive "snippet" in the Google listing are even different from what you see from a browser! (which is what first drew my attention to these pages)
The result is the entire cloaked page contents are swapped-out, including the title.
Spider sees one page, browser sees another. Classic cloaking, done client-side to evade detection.
Would love to get all the details to Google, so they can deal with this.
GoogleGuy, you still out there?
| This 58 message thread spans 2 pages: < < 58 ( 1  ) |