This document outlines the steps that are necessary in order to make your AJAX application crawlable. Once you have fully understood each of these steps, it should not take you very long to actually make your application crawlable!
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler.
Msg#: 4091398 posted 2:44 am on Mar 7, 2010 (gmt 0)
As I understand it, the issue is not really about reading XML. The issue is indexing a URL with a # mark as being different from the URL without the #. If your site needs this technology, then your various AJAX page states were already NOT getting indexed by the search engines, and this technology is a help, and worth investing the resources to get more search traffic.
I think Google is correct here. The web needs a different way to identify a new AJAX call to the server for page content that wasn't served in the first download. Recycling the simple hash mark for that purpose has definitely been problematic. That's even more the case with the new interest in indexing the page fragments with direct links in the snippet.
Msg#: 4091398 posted 8:15 pm on Mar 7, 2010 (gmt 0)
Surely it's possible to determine the likelihood of a hash link on a page being in-page or external? If the page is XML it's likely to be the latter, surely? Follow the link and see what comes up: if it's not XML, drop it.
Or perhaps I'm being too simplistic. I come from an age when programming was logical and structured, before internet "gurus" got hold of the idea. :)
As to getting more search traffic: if the browser can't display it, surely it ain't much use? The punter (eg me) will go away and find a page that can be viewed.
Or is non-viewable an XML design fault? If so, I think IANA should be told.
Msg#: 4091398 posted 9:54 pm on Mar 7, 2010 (gmt 0)
You're right - a major part of the html/xhtml on the web is a far, far distance from anything you could call disciplined programming. If web pages were treated as true programming, then your browser would be blank almost all of the time today!
Also a crawler is definitely not using the same kind of approach that a visual browser uses. Different goals, different end uses. I don't know if this #! approach that Google suggests will really fly or not, but it's not too bad as a first attempt.