Are you sure it was googlebot? There is a hidden text filter but it is only invoked as a result of a spam report at present (although that may be old news).
Was the IP from the normal googlebot range?
126.96.36.199 - - [22/Jul/2003:02:36:50 -0400] "GET /style/MYMENU.js HTTP/1.0" 200 25772 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Is this normal? New?
Wow! Thanks for letting us know. This could have tremendous implications for some webmasters. I don't have any external JS, I wonder if others are also seeing this.
There's a few threads about it, been noticed for a while.
Assuming that Google are :-
B) doing this as part of their anti-seo/spam strategies
can we be certain that robots.txt will be respected?
respecting robots.txt standard does NOT mean that a robot should not read the file, but does mean that it should not be indexed. Given that robots can still respect robots.txt and still read you file, analyze it, and use it in other ways other than indexing it..
Any way that is my understanding. Happy to be proved wrong.
|respecting robots.txt standard does NOT mean that a robot should not read the file |
In fact, that's exactly what it does mean by my reading:
It's interesting though, if I were Google, I think I'd analyze the file anyway - it would certainly clean house in many SERPS.
Wow! Only two weeks ago I started a topic 'bout this 'cos I needed my JS links to be followed. Now you are telling that gbot retrieves .js files? Wonderful! :) :) :)
I want to an answer to this: even I'm relatively new as webmaster, i've some experience in programming and, after all, JS is a programming language and any robot is a program. So here it goes:
- The implementation to crawl through JS calls would be similar to the one to crawl <a> links, so it could be done.
So, this is my oponion: google is beginning to follow JS. The cheaters will need to search for another way to cheat, and I won't need to remake my menus to get them spidered and crawled.
Have you any evidence yet that Googlebot has requested a page that could only have been discovered by parsing your .js?
That would be interesting to know.... keep watching your logs and let us know if it does!
ah yes... i stand corrected thanks bcolflesh..
which brings me to the next point.. seeing the js is delivered to the page on a non robot.txt'd page could google or any other robot claim that it is quite legitimate to read it, as it actually appears on a legitimate page and they didnt have to crawl it independently?
it may be a fine line here.. For example when they spider some SSI pages the content may come from many original directories if it has server side includes.. The text and code to the crawler in this case is "on the page" to the robot as processing is done before it is crawled, but in reality some of it is fetched from other directories. Now js, at least the types i know of, isnt server side, but the output still appears on the page... and may be argued to be fair game for indexing as part of the page, but not as a separate file if its been robot.txt'd in the source directory.
[edited by: chiyo at 4:56 pm (utc) on July 24, 2003]
That's what I'm wondering about too - I don't think it's a secret that many competitive searches are filled with sites spamming via js files - these guys (if they were smart) already had the folders where their js files are excluded.
I'm saying Google should scan the file if it's linked to an indexed file.
|Have you any evidence yet that Googlebot has requested a page that could only have been discovered by parsing your .js? |
Anyone out there that has had Googlebot follow their external .js file to a page that wasn't linked anywhere else?
|188.8.131.52 - - [21/Jul/2003:03:06:29 -0500] "GET /XXX.js HTTP/1.0" 200 520 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" |
var url1 = 'http://SITENAME.COM/page_1.html';
var url2 = 'SITENAME.COM/page_2.html';
var url3 = '/page_3.html';
var url4 = 'http://' + 'SITENAME' + '.com/page_4.html';
var url5 = 'http://SITENAME.COM/page_5.html';
var url6 = 'href=http://SITENAME.COM/page_6.html';
var url7 = 'href="http://SITENAME.COM/page_7.html"';
var url8 = "http://SITENAME.COM/page_7.html";
I read that it might be initiated as some kind of spam check after a spam report, looking for hidden text... but my site in question is under 6 weeks old and is very unlikely to have been reported for spam.
Well, I use numerous external JS files to writeln content, links, utilities into webpages and Google is NOT finding any of this.
Here's what it the external linking looks like on my pages... might be relevant.
kicked, because evidently there is a newletter story out there stating this is a new phenom. We had a bunch of post submissions that suggest this was new. Obviously - it is NOT.
But you are definitely wrong on this one. This number on www was about 15,000 this morning and it is now 30,000 12 hours later. Look at all the url's that Google has listed that end in .txt
>> that end in .txt
"These terms only appear in links pointing to this page: allintext document write a href"
| This 80 message thread spans 3 pages: < < 80 ( 1 2  ) |