Welcome to WebmasterWorld Guest from 54.145.209.107

Googlebot Processing Javascript Functions

   
1:43 pm on May 17, 2012 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



Googlebot Processing Javascript Functions [arstechnica.com]
During the last quarter of 2011, Google finally started to figure out how to efficiently solve the problem from its end, and began to roll out bots that could explore the dynamic content of pages in a limited fashion—crawling through the JavaScript within a page and finding URLs within them to add to the crawl. This required Google to allow its crawlers to send POST requests to websites in some cases, depending on how the JavaScript code was written, rather than the GET request usually used to fetch content. As a result, Google was able to start indexing Facebook comments, for example, as well as other "dynamic comment" systems.
Now, based on the logs Pankratov has shown, it appears that rather than just mining for URLs within scripts, the bots are crawling even deeper than comments, processing JavaScript functions in a way that mimics how they run when users click on the objects that activate them. That would give Google search even better access to the "deep Web"—content hidden in databases and other sources that generally hasn't been indexable before.
.
6:58 pm on May 17, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The big deal here - and Google's urgency - is indexing AJAX content, I assume.
7:05 pm on May 17, 2012 (gmt 0)



Wasn't Googlebot adjusted to crawl FB comments and only them?
8:10 pm on May 17, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Google's bots added a quarter of a million quids worth of products to the shopping basket of a site last week. They're now blocked.
9:34 pm on May 17, 2012 (gmt 0)

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Google had switched to a visual method of recording page content a long time ago, before they launched page previews. They've been able to pick up textual comments loaded by javascript for a long time now. The real news is that they've begun trusting googlebot to dig deeper into code too.

I would NOT be surprised if your site(or you as a webmaster) needs to pass a *sniff test* with their visual methods first, or have a trustworthy history, before googlebot opens up in your javascript with post requests.
8:42 am on May 18, 2012 (gmt 0)

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member



g1smd, was googlebot cookied through at least part of the checkout process?
11:21 am on May 18, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They were doing this 6+ months ago with all that chatter around Disqus comments and Facebook comments.
1:44 pm on May 18, 2012 (gmt 0)

WebmasterWorld Senior Member planet13 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Google's bots added a quarter of a million quids worth of products to the shopping basket of a site last week. They're now blocked.


Huhh... on my shopping cart, I would get these mysterious instances where SEVERAL visitors would all arrive at the same time and add one product to the shopping cart and then leave.

They would all be added at the same minute. At first I thought it was some competitor trying to deplete my inventory, but now it seems more likely that it could be googlebot (because they only order ONE of each item, while a competitor would order hundreds or thousands of an item to deplete the inventory).
7:51 pm on May 18, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



if not real user with real web browser then
do not display ANY forms
end if

Ditto javascript, css, whatever
4:14 am on May 19, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Would it run javascript code of a js file located in a directory blocked by robots.txt?
9:23 am on May 19, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Would it run javascript code of a js file located in a directory blocked by robots.txt?

Preview definitely would if it could-- but then, Preview isn't a robot. So would the plainclothes bingbot.
7:42 pm on May 19, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



koan - you cannot rely on any bot to obey robots.txt in all situations. They used to but, as lucy notes, preview and other plain-clothes bots can do anything.

Detect IP ranges, UAs, headers, whatever either within an htaccess file or within the page itself (I would have thought webmasters should be doing that anyway to determine real visitors). What you do with the catch when you get it depends on you. I generally throw it back with a 403 or, in the case of JS, do not load it within the page.
1:54 am on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I generally throw it back with a 403 or, in the case of JS, do not load it within the page.

Do you mean that you include a bit in the js itself to detect the UA and/or IP and act accordingly? So the page gets a little bit fatter but you're shifting the work from your server to the visitor's computer?
6:10 am on May 20, 2012 (gmt 0)

5+ Year Member



Well there goes another area of PRIVACY thanks to Google.

A common practice to protect links and content from being indexed is to use JavaScript to write in the link.

Lord, please forgive them for they do not know what they do... after all they are only criminally insane idiots.
6:40 am on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The processing of javascript links should not be news to anyone who's been paying attention. It's been happening (and discussed here) for several years. To protect those javascripted links takes another step - Disallow googlebot from crawling your JS file, for instance.

This change is about being able to crawl AJAX content, presumably without the clunky hash-bang workaround. The sky is not always falling ;)
8:57 am on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Disallow googlebot from crawling your JS file, for instance.

As if, if G has a mind to it, G will follow your robots.txt disallow

I have caught G more than once disregarding robots.txt rules
8:47 pm on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Lucy - no, it's all part of the page processing. JS never gets sent to bots.

Staffa - see my previous post re: not having to rely on the clunky and easily ignored robots.txt.
9:31 pm on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles - I know and I certainly do not rely on robots.txt, I'm just surprised that after all those years it's still suggested by tedster

tedster - unless you are using Disallow in the broader sense and not necessarily via robots.txt, in which case "Disallow" threw me off track as it is so specifically associated with robots.txt
10:34 pm on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



No - I did mean robots.txt. I've had good luck with it, although I have heard that others ran into trouble. Do you have any idea about what the differences might be? I've mostly used it to keep affiliate links from "being counted."
10:44 am on May 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have no idea whatsoever, the sites are as plain as they come without any ads or other external input. The javascript and css are purely for visitors and disallowed in robots.txt and when
G ignores this it gets whacked like any other rogue bot
4:25 pm on May 21, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



The javascript and css are purely for visitors and disallowed in robots.txt and when
G ignores this it gets whacked like any other rogue bot

Do you have the googlebot itself reading and acting on javascript? See, I could have sworn I'd caught it myself. Many times. But I pored over logs and all I could find was Preview consistently misbehaving.
5:43 pm on May 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it's Gbot itself occasionally fetching css and js files.
Preview and translate are blocked as standard ;o)
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month