Welcome to WebmasterWorld Guest from 54.196.224.166

Message Too Old, No Replies

Google Makes AJAX Applications Crawlable

     

engine

2:21 pm on Mar 4, 2010 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



Google Makes AJAX Applications Crawlable [code.google.com]
This document outlines the steps that are necessary in order to make your AJAX application crawlable. Once you have fully understood each of these steps, it should not take you very long to actually make your application crawlable!

Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler.

dstiles

12:18 am on Mar 5, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



So the onus is on the site's owner to pay for extra work on the site, then. :(

What's wrong with modifying a standard XML reader as a bot? Surely THAT knows whether a file is XML or not? Don't the headers give a clue?

Have to say I'm glad I'm not coding XML. I tried to read Numbers pages on the IANA site yesterday using Firefox. All I got was garbage. But at least it told me at the top that it was XML.

tedster

2:44 am on Mar 7, 2010 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



As I understand it, the issue is not really about reading XML. The issue is indexing a URL with a # mark as being different from the URL without the #. If your site needs this technology, then your various AJAX page states were already NOT getting indexed by the search engines, and this technology is a help, and worth investing the resources to get more search traffic.

I think Google is correct here. The web needs a different way to identify a new AJAX call to the server for page content that wasn't served in the first download. Recycling the simple hash mark for that purpose has definitely been problematic. That's even more the case with the new interest in indexing the page fragments with direct links in the snippet.

dstiles

8:15 pm on Mar 7, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Surely it's possible to determine the likelihood of a hash link on a page being in-page or external? If the page is XML it's likely to be the latter, surely? Follow the link and see what comes up: if it's not XML, drop it.

Or perhaps I'm being too simplistic. I come from an age when programming was logical and structured, before internet "gurus" got hold of the idea. :)

As to getting more search traffic: if the browser can't display it, surely it ain't much use? The punter (eg me) will go away and find a page that can be viewed.

Or is non-viewable an XML design fault? If so, I think IANA should be told.

tedster

9:54 pm on Mar 7, 2010 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You're right - a major part of the html/xhtml on the web is a far, far distance from anything you could call disciplined programming. If web pages were treated as true programming, then your browser would be blank almost all of the time today!

So the job of trying to surface relevant content from that soup is extremely challenging - especially with the amount of truly malicious JavaScript that is being served out there. No one would want that code to just run on their servers.

Also a crawler is definitely not using the same kind of approach that a visual browser uses. Different goals, different end uses. I don't know if this #! approach that Google suggests will really fly or not, but it's not too bad as a first attempt.

g1smd

10:02 pm on Mar 7, 2010 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The # on its own already has a specific meaning.

So, I applaud that site owners have a new way (using the #! combination) to signify content they do want crawled, rather than Google 'assuming' and just barging on in.

Bentler

1:28 am on Mar 11, 2010 (gmt 0)

10+ Year Member



I like this. Very useful if designed thoughtfully.