Making AJAX Page States Crawlable - Google's Proposal - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Making AJAX Page States Crawlable - Google's Proposal

tedster

10:07 pm on Oct 7, 2009 (gmt 0)

There's been some trouble with the young technical darling called AJAX. When the state of the page is coded into the URL by using a hash mark (#) it works for the human visitor - but search engines do not consider the # mark to indicate a different url, because that character is also used commonly as a page fragment identifier.

Google's proposal is to insert a new token after the hash mark. That would alert crawling technology that the following information creates a new page state, often by going to the server to update only part of the page content.

The new token Google proposes is an exclamation mark added immediately after the hash tag, so a stateful AJAX url might look like this:

http://www.example.com/page?query#!state

This approach would allow stateful AJAX urls to be shown in search results. More detail is available in the Google Webmaster Central Blog article [googlewebmastercentral.blogspot.com]

Future

12:15 am on Oct 8, 2009 (gmt 0)

Its little hard to understand even after rading the entire article @ GWT blogspot ?

whitenight

12:24 am on Oct 8, 2009 (gmt 0)

I have an idea. How about Google NOT Propose any more "webwide changes" that:

- MAINLY helps ONLY their SE.
- Could possibly change it's original intent AFTER the rest of web decides to use it how the web wishes, instead of how Goog wishes.

Google used up all their "benefit of the doubt" and "this is how we'd like to improve the web" points with their constant change in philosophy of rel=nofollow. At the end of the day, AJAX only really helps Google. NOT the 99.9% of the web who doesn't use or even know what AJAX is.

[edited by: encyclo at 1:29 am (utc) on Oct. 8, 2009]

tedster

12:34 am on Oct 8, 2009 (gmt 0)

I can easily see this would benefit ALL search engines and many end users. Taking the user directly to content that's only available in a modified page state would be a good thing.

Many/most webmasters will never have need for AJAX. But for those who use it, making that content more crawlable is a very sane goal.

Could possibly change it's original intent AFTER the rest of web decides to use it how the web wishes, instead of how Goog wishes.

This proposal comes to meet a real need that a portion of the web has already decided on. Not everything Google does is evil.

I'd also like to see a proposal from Bing about the challenge of indexing AJAX modified pages.

whitenight

12:42 am on Oct 8, 2009 (gmt 0)

Many/most webmasters will never have need for AJAX. But for those who use it, making that content more crawlable is a very sane goal.

Many/most webmasters will never have need for rel=nofollow. But for those who use rel=nofollow, making that change is a very sane goal.

This proposal comes to meet a real need that a portion of the web has already decided on

Same arguments were used around rel=nofollow. -.-

Tell it to someone who gives Goog the benefit of the doubt.

I can see where this is going, even if others don't.

tedster

12:47 am on Oct 8, 2009 (gmt 0)

There was a Tweet I saw yesterday that apparently had this story wrong. It said Google was going to start serving everyone AJAX results pages. That's a whole different critter. This is about search engines INDEXING ajax content on OTHER sites. It's not about Google SERVING their own ajax pages - that's something Google is already testing.

The Tweet thing circulated a rumor that may have originally been about this proposal, but it was not accurately understood at all, and it came out garbled.

[edited by: tedster at 3:01 am (utc) on Oct. 8, 2009]

whitenight

12:52 am on Oct 8, 2009 (gmt 0)

Google wants to change the ajax parameters so THEY can start using Ajax on all their pages to make THEIR job easier. Basically, the want to enforce a change that, yes, will benefit the .01% of the web that uses Ajax BUT Google is the main Ajax user and it's incompatible with most of the web's standard services in various ways.

Hence, it helps Google itself more than any mass of webmasters or web users who are demanding a change to Ajax handling...

[edited by: encyclo at 1:32 am (utc) on Oct. 8, 2009]

TheMadScientist

12:53 am on Oct 8, 2009 (gmt 0)

Many/most webmasters will never have need for AJAX.

Very good point tedster... Most people can't even implement a simple redirect, canonicalize domains, or remove the /index.html from their directories without expert help, let alone code a site that's AJAX based.

Personally, I own two AJAX based sites and there is one site I would like to have crawled, and another I do not want crawled, so the one I don't want crawled bans not only Googlebot, but all other compliant Bots in the robots.txt, but this certainly (enormously) simplifies the SEO on the other.

Honestly, if it was me and I had the traffic Google does, I would not suggest, I would state: If you run an AJAX based website and would like your site to be crawled and indexed by Google, place an ! after the # symbol to tell GoogleBot how to access the information.

People think Google is 'overstepping' or 'out of line' by suggesting? LMAO. Be glad I'm not in charge at Google, because I wouldn't ask I would dictate, much the same way M$ does with their browsers / software...

[edited by: encyclo at 1:33 am (utc) on Oct. 8, 2009]

caribguy

2:02 am on Oct 8, 2009 (gmt 0)

The suggestion is to have a "headless browser" on your server generate static HTML and send that to Googlebot.

Example: return widget shops in various states:

http://www.example.com/returnShoplist#FL would become
http://www.example.com/returnShoplist#!FL - telling G that there is Ajax content

The bot then returns and asks for:

http://example.com/returnShoplist?_escaped_fragment_=FL

and your server generates the relevant html, through a headless browser

G would then return http://www.example.com/returnShoplist#!FL
for a query on Florida widget shops

Edit: in reply to Future way above - you guys must be able to read a lot faster than me :)

Also, I have some serious doubts about cost / benefit for webmasters vs cost / benefit for big G. I'll certainly not be jumping on this bandwagon anytime soon.

tedster

2:14 am on Oct 8, 2009 (gmt 0)

When AJAX is truly an appropriate technology, it's for deep and complex interactions with stored data on the server - such as occurs in web applications. That's not going to generate important content for the search engines, since the state changes involved are very complex and even specific to just one user.

Unfortunately there is now a lot of inappropriate ajax around the web - the kind of thing that's done mostly just to display someone's technical prowess (geek credentials.) That approach is hiding useful content and I think such situations are what this proposal is an attempt to resolve.

aakk9999

2:15 am on Oct 8, 2009 (gmt 0)

So if I have a static page that is using a bit of AJAX to change a picture when clicked on a link, then by Google indexing this, it would generate me x pages of duplicate content because everything on the page is the same apart from the photo.

So if I remain to use # (i.e. do not implement #!), would this mean that in this way I can avoid the duplicate content?

Or would it require extensive use of rel="canonical" on such pages?

Or basically, what I am asking is will the pages that use AJAX but not change # into #! still be ignored and not indexed by Google, as in this case it is within the user control whether they want the page with changed state using AJAX indexed or not?

[edited by: aakk9999 at 2:18 am (utc) on Oct. 8, 2009]

tedster

2:17 am on Oct 8, 2009 (gmt 0)

As I understand it, the server would not reply with the entire page a second time, so there is no duplicate content issue. The server response would be only the changes for the new state of the page.

And yes, the lone # would not be indexed, only the #!

aakk9999

2:21 am on Oct 8, 2009 (gmt 0)

Yes, I understand this, when I click on pic link, only the photo is requested from the server and replaced on the page.

But from what I understand, Google proposes that they index changed state of the page. So I currently have:

www.example.com/page1.html
www.example.com/page1.html#pic1
www.example.com/page1.html#pic2
www.example.com/page1.html#pic3

So Google says all these could appear in SERPS?

And in my example these pages are the same really apart from a different photo.

tedster

2:25 am on Oct 8, 2009 (gmt 0)

Unlikely that that example would appear in the SERP because there's no text content. And as long as there's no ! preceding the state tag, then it's not going to be crawled anyway. So this proposal would only be for those sites who want to implement it because they DO have ajax content they want crawled.

encyclo

2:26 am on Oct 8, 2009 (gmt 0)

So if I remain to use # (i.e. do not implement #!), would this mean that in this way I can avoid the duplicate content?

As I understand Google's proposal, you are correct in this analysis. The issue for Google is the same as for you - currently they cannot adequately distinguish distinct content changes on AJAX-driven pages, and they can't second-guess without creating never-ending duplicates. The proposal would offer the appropriate hints to Googlebot as to what is unique content and what is not.

Google is at the forefront of AJAX usage, but they are not looking to improve indexing their own content, rather to promote AJAX adoption and improve their SERPs. It is in their interest to get more AJAX-driven content into their search results (improving search quality), and such inclusion would be mutually beneficial for webmasters who are currently wary of adoption AJAX solutions due to the technical challenge of getting into the current Google index.

aakk9999

2:32 am on Oct 8, 2009 (gmt 0)

OK, so in that case it would be a conscious decision by webmaster to change their AJAX code to use #! instead of # if they want the page indexed.

For example in the cases where the frame (heading / footer / sidebars) of the page is graphic heavy and AJAX is used to change text content to save load time. So generated "page" IS a new page when looking from the content point of view but Google is not able to crawl it at the moment.

As long as it is in the user control whether they add ! then this should impact only ones who want to be impacted.

encyclo

2:47 am on Oct 8, 2009 (gmt 0)

Google has only recently started taking an interest in the "#" in URLs, previously it was simply ignored (and still is for other search engines) as "#" is usually handled by the browser - it does not get sent via HTTP unless you use AJAX (ie. controlled client-side). Now see:

Fragment Identifiers aka Named Anchors [webmasterworld.com]

It is important, I think, to consider both the AJAX proposition and the (already implemented) new consideration of fragment-identifiers as two sides of the same coin. This is the same technical issue: fragment-identifiers are the "old-school" use of "#". Ask yourself - how will your current AJAX implementation be seen in relation to Google's attempts to subdivide pages into named fragments?

aakk9999

4:02 am on Oct 8, 2009 (gmt 0)

Ask yourself - how will your current AJAX implementation be seen in relation to Google's attempts to subdivide pages into named fragments?

Interesting thought...

Perhaps Google can distinguish between # where the reference is to an anchor within the same page (which, from what I have seen, are the cases where page fragments were shown as "Jump to" links in SERPS snippet), from # where there is a request to get some part of the page content from the server by using AJAX.

With AJAX pages being indexed, I would imagine that this would not be a "jump to a page fragment", instead it would appear as URL in its own right (including #!etc) in SERPS.

Maybe Google should allocate another "test data centre" like Caffeine to webmaster community before they put these changes live as if it ends up with bugs then who knows what this could do to SERPS, to both, sites not using AJAX (influx of new indexed content), and sites using AJAX (could suddenly trip various filters owing to all the new site content suddenly being indexed).

Or should the SEO strategy for AJAX sites be "add ! to # little by little and see the impact on your site ranking..."

TheMadScientist

6:37 am on Oct 8, 2009 (gmt 0)

Perhaps Google can distinguish between # where the reference is to an anchor within the same page (which, from what I have seen, are the cases where page fragments were shown as "Jump to" links in SERPS snippet), from # where there is a request to get some part of the page content from the server by using AJAX.

Actually aakk9999 that part is fairly easy, because all you really have to do is parse the page referenced with #Reference in the link and check for <a name="Reference">Anchor Text</a> and you can determine it's already content on the page.

I may be over-simplifying slightly, but for the most part /page-linked.html#Reference + name="Reference" within <a > on /page-linked.html, indicates a named anchor. It's actually relatively simple to determine an anchor tag link given the overall complexity of spidering the Web. If it's not a named anchor <a name=""></a>, you can reasonably determine it's another type of reference, which would include AJAX.

In reading the WebmasterCenteral Blog, (I skimmed through it) I think where they have the tough part is actually parsing the JS / AJAX to get to the content, so they want you to serve it in a slightly different manner to SE spiders, GBot in particular, than via JS / AJAX, and the #! would indicate accessibility...

I could be mistaken in the preceding though as it's late and I have had an adult beverage. :)

httpwebwitch

1:18 pm on Oct 8, 2009 (gmt 0)

This proposal is as good as any I've heard so far. I'd also like to hear from Bing, Yahoo, Ask, Cuil, and some other players before this gets widely adopted.

yaix2

2:57 pm on Oct 8, 2009 (gmt 0)

Sounds like a good idea, its backwards compatible, does not interfere with existing implementations, and does not add too much complexity.

Since the # is the only character that separates server side and client side parts of the URL, there is anyway not much choice, if you don't want to make major changes on how URLs are interpreted today.

StoutFiles

4:21 pm on Oct 8, 2009 (gmt 0)

The AJAX I use on my site (tabbed content) doesn't change the URL at all... :(

I'm surprised no one here has mentioned the benefits of having AdSense on AJAX pages whereas that wouldn't have been allowed before. Or will it still not be allowed?

oender

8:00 pm on Oct 8, 2009 (gmt 0)

if i use ajax om my website
users and search engines see not the same content

can google interpret this as cloaking ?

thanks

tedster

8:32 pm on Oct 8, 2009 (gmt 0)

Following this Google proposal, users and search engines would have a url that gives them both the same content, the same page state.

Current ajax implementations show the default page state to a spider and also to the first arrival of a human visitor, before they click on an ajax link. That does not create a cloaking problem.

httpwebwitch

11:50 pm on Oct 8, 2009 (gmt 0)

Consequently, webmasters should start including hashed URLs in their sitemaps - ones with a #! to allow bots to request pages directly with predetermined AJAX content.

And I don't believe it's a coincidence that #! is a Unix "hash-bang", a 2-character command indicating that the contents of a file is a script which should be executed.

TheMadScientist

11:59 pm on Oct 8, 2009 (gmt 0)

And with this announcement we can abolish any notions that bots can't or don't execute javascript. We've known for some time that they can, and now we can be sure that they will. A URL with a #! requires the user-agent to execute AJAX requests to retrieve content for indexing, which requires the user-agent to execute scripts. QED.
And I don't believe it's a coincidence that #! is a Unix "hash-bang", a 2-character command indicating that the contents of a file is a script which should be executed.

I was wondering about this, and keep thinking if that's the case (they can / do execute JS), then why ask anyone to make a change? Why not just spider it if they can?

vincevincevince

3:02 am on Oct 9, 2009 (gmt 0)

I don't think this is the right solution. Anyone implementing AJAX should be making it degrade gracefully anyway. That means, if I hit the site without javascript, the system serves me real URLs to real pages that may not fade or flash, but have the same content as the ones delivered by AJAX. Those same pages are then ideal fodder for Google. So, this is a non-issue.

Bentler

7:35 pm on Oct 9, 2009 (gmt 0)

I like this-- I like it a lot -- and plan to apply it.

jsherrod

4:01 pm on Oct 12, 2009 (gmt 0)

This has been a need for a long time. Many site owners implement AJAX (and Flash, and JS links) without any regard for search engines and the ability of deeper content to be fully indexed by search engines.

And then they wonder why Google doesn't love them.

For those sites that already have tons of content buried in AJAX, this will be a great way to get content attribution without having to create a separate channel for accessibility or having to rewrite a bunch of code.

I like it. I hope it gets adopted.

caribguy

9:46 pm on Oct 12, 2009 (gmt 0)

I doubt that site owners who implement Ajax, Flash etc. without regard for search engines will be among the first to make the suggested changes.

Graceful degradation is the way to go, this ensures that all of that "buried" content will be accessible by the broadest range of users. Including lynx, screen readers, mobile devices - and not just Google.

This 31 message thread spans 2 pages: 31