Welcome to WebmasterWorld Guest from 54.196.245.74

Message Too Old, No Replies

Googlebot Processing Javascript Functions

     
1:43 pm on May 17, 2012 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:22318
votes: 239


Googlebot Processing Javascript Functions [arstechnica.com]
During the last quarter of 2011, Google finally started to figure out how to efficiently solve the problem from its end, and began to roll out bots that could explore the dynamic content of pages in a limited fashion—crawling through the JavaScript within a page and finding URLs within them to add to the crawl. This required Google to allow its crawlers to send POST requests to websites in some cases, depending on how the JavaScript code was written, rather than the GET request usually used to fetch content. As a result, Google was able to start indexing Facebook comments, for example, as well as other "dynamic comment" systems.
Now, based on the logs Pankratov has shown, it appears that rather than just mining for URLs within scripts, the bots are crawling even deeper than comments, processing JavaScript functions in a way that mimics how they run when users click on the objects that activate them. That would give Google search even better access to the "deep Web"—content hidden in databases and other sources that generally hasn't been indexable before.
.
6:58 pm on May 17, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


The big deal here - and Google's urgency - is indexing AJAX content, I assume.
7:05 pm on May 17, 2012 (gmt 0)

Junior Member

joined:May 13, 2011
posts:115
votes: 0


Wasn't Googlebot adjusted to crawl FB comments and only them?
8:10 pm on May 17, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google's bots added a quarter of a million quids worth of products to the shopping basket of a site last week. They're now blocked.
9:34 pm on May 17, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member

joined:Apr 14, 2010
posts:3169
votes: 0


Google had switched to a visual method of recording page content a long time ago, before they launched page previews. They've been able to pick up textual comments loaded by javascript for a long time now. The real news is that they've begun trusting googlebot to dig deeper into code too.

I would NOT be surprised if your site(or you as a webmaster) needs to pass a *sniff test* with their visual methods first, or have a trustworthy history, before googlebot opens up in your javascript with post requests.
8:42 am on May 18, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member

joined:Apr 14, 2010
posts:3169
votes: 0


g1smd, was googlebot cookied through at least part of the checkout process?
11:21 am on May 18, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 10+ Year Member

joined:June 12, 2003
posts:702
votes: 9


They were doing this 6+ months ago with all that chatter around Disqus comments and Facebook comments.
1:44 pm on May 18, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member planet13 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:June 16, 2010
posts: 3796
votes: 28


Google's bots added a quarter of a million quids worth of products to the shopping basket of a site last week. They're now blocked.


Huhh... on my shopping cart, I would get these mysterious instances where SEVERAL visitors would all arrive at the same time and add one product to the shopping cart and then leave.

They would all be added at the same minute. At first I thought it was some competitor trying to deplete my inventory, but now it seems more likely that it could be googlebot (because they only order ONE of each item, while a competitor would order hundreds or thousands of an item to deplete the inventory).
7:51 pm on May 18, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3092
votes: 2


if not real user with real web browser then
do not display ANY forms
end if

Ditto javascript, css, whatever
4:14 am on May 19, 2012 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2005
posts:1693
votes: 4


Would it run javascript code of a js file located in a directory blocked by robots.txt?
9:23 am on May 19, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12716
votes: 244


Would it run javascript code of a js file located in a directory blocked by robots.txt?

Preview definitely would if it could-- but then, Preview isn't a robot. So would the plainclothes bingbot.
7:42 pm on May 19, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3092
votes: 2


koan - you cannot rely on any bot to obey robots.txt in all situations. They used to but, as lucy notes, preview and other plain-clothes bots can do anything.

Detect IP ranges, UAs, headers, whatever either within an htaccess file or within the page itself (I would have thought webmasters should be doing that anyway to determine real visitors). What you do with the catch when you get it depends on you. I generally throw it back with a 403 or, in the case of JS, do not load it within the page.
1:54 am on May 20, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12716
votes: 244


I generally throw it back with a 403 or, in the case of JS, do not load it within the page.

Do you mean that you include a bit in the js itself to detect the UA and/or IP and act accordingly? So the page gets a little bit fatter but you're shifting the work from your server to the visitor's computer?
6:10 am on May 20, 2012 (gmt 0)

Preferred Member

10+ Year Member Top Contributors Of The Month

joined:May 27, 2005
posts:428
votes: 3


Well there goes another area of PRIVACY thanks to Google.

A common practice to protect links and content from being indexed is to use JavaScript to write in the link.

Lord, please forgive them for they do not know what they do... after all they are only criminally insane idiots.
6:40 am on May 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


The processing of javascript links should not be news to anyone who's been paying attention. It's been happening (and discussed here) for several years. To protect those javascripted links takes another step - Disallow googlebot from crawling your JS file, for instance.

This change is about being able to crawl AJAX content, presumably without the clunky hash-bang workaround. The sky is not always falling ;)
8:57 am on May 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Disallow googlebot from crawling your JS file, for instance.

As if, if G has a mind to it, G will follow your robots.txt disallow

I have caught G more than once disregarding robots.txt rules
8:47 pm on May 20, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3092
votes: 2


Lucy - no, it's all part of the page processing. JS never gets sent to bots.

Staffa - see my previous post re: not having to rely on the clunky and easily ignored robots.txt.
9:31 pm on May 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


dstiles - I know and I certainly do not rely on robots.txt, I'm just surprised that after all those years it's still suggested by tedster

tedster - unless you are using Disallow in the broader sense and not necessarily via robots.txt, in which case "Disallow" threw me off track as it is so specifically associated with robots.txt
10:34 pm on May 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


No - I did mean robots.txt. I've had good luck with it, although I have heard that others ran into trouble. Do you have any idea about what the differences might be? I've mostly used it to keep affiliate links from "being counted."
10:44 am on May 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


I have no idea whatsoever, the sites are as plain as they come without any ads or other external input. The javascript and css are purely for visitors and disallowed in robots.txt and when
G ignores this it gets whacked like any other rogue bot
4:25 pm on May 21, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12716
votes: 244


The javascript and css are purely for visitors and disallowed in robots.txt and when
G ignores this it gets whacked like any other rogue bot

Do you have the googlebot itself reading and acting on javascript? See, I could have sworn I'd caught it myself. Many times. But I pored over logs and all I could find was Preview consistently misbehaving.
5:43 pm on May 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Yes, it's Gbot itself occasionally fetching css and js files.
Preview and translate are blocked as standard ;o)
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members