This document outlines the steps that are necessary in order to make your AJAX application crawlable. Once you have fully understood each of these steps, it should not take you very long to actually make your application crawlable!
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler.
12:18 am on Mar 5, 2010 (gmt 0)
So the onus is on the site's owner to pay for extra work on the site, then. :(
What's wrong with modifying a standard XML reader as a bot? Surely THAT knows whether a file is XML or not? Don't the headers give a clue?
Have to say I'm glad I'm not coding XML. I tried to read Numbers pages on the IANA site yesterday using Firefox. All I got was garbage. But at least it told me at the top that it was XML.
2:44 am on Mar 7, 2010 (gmt 0)
As I understand it, the issue is not really about reading XML. The issue is indexing a URL with a # mark as being different from the URL without the #. If your site needs this technology, then your various AJAX page states were already NOT getting indexed by the search engines, and this technology is a help, and worth investing the resources to get more search traffic.
I think Google is correct here. The web needs a different way to identify a new AJAX call to the server for page content that wasn't served in the first download. Recycling the simple hash mark for that purpose has definitely been problematic. That's even more the case with the new interest in indexing the page fragments with direct links in the snippet.
8:15 pm on Mar 7, 2010 (gmt 0)
Surely it's possible to determine the likelihood of a hash link on a page being in-page or external? If the page is XML it's likely to be the latter, surely? Follow the link and see what comes up: if it's not XML, drop it.
Or perhaps I'm being too simplistic. I come from an age when programming was logical and structured, before internet "gurus" got hold of the idea. :)
As to getting more search traffic: if the browser can't display it, surely it ain't much use? The punter (eg me) will go away and find a page that can be viewed.
Or is non-viewable an XML design fault? If so, I think IANA should be told.
9:54 pm on Mar 7, 2010 (gmt 0)
You're right - a major part of the html/xhtml on the web is a far, far distance from anything you could call disciplined programming. If web pages were treated as true programming, then your browser would be blank almost all of the time today!
Also a crawler is definitely not using the same kind of approach that a visual browser uses. Different goals, different end uses. I don't know if this #! approach that Google suggests will really fly or not, but it's not too bad as a first attempt.
10:02 pm on Mar 7, 2010 (gmt 0)
The # on its own already has a specific meaning.
So, I applaud that site owners have a new way (using the #! combination) to signify content they do want crawled, rather than Google 'assuming' and just barging on in.
1:28 am on Mar 11, 2010 (gmt 0)
I like this. Very useful if designed thoughtfully.