Googlebot and XHTML Links - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Googlebot and XHTML Links

Googlebot not XHTML Complient?

daroz

12:47 am on Jan 20, 2003 (gmt 0)

10+ Year Member

I've got a site that uses a flash detection page as well as session cookies (fallback to appending to the URL).

The Googlebot for the last 2 cycles seems to now be choking on valid XHTML links. For example, an href of "www.example.com/index.php?val=1&val2=1" is valid XHTML. (If you just used a regular '&' instead of the full '&' it would be invalid.

According to the W3C Specs (http://www.w3.org/TR/xhtml1/#C_12) the user agent, GoogleBot in this case, has to replace the '&' with '&' before following to that link.

Now here's the problem: The Flash Detection Page (which obviously isn't detecing flash -- nor Javascript) has a worst-case option of falling back to an HTTP META tag to move forward. That URL has 2 parameters and uses & to seperate the values, as it should. However, GoogleBot is _not_ converting the '&' to '&' and is passing the URL in the META tag back to the webserver as-is. This leads to the session value not being passed on, and restarting the session and going back to redo the Flash Detection.

Needless to say this is a problem.

Has anyone else seen this?

(And no, I don't want to disable the Flash or Session items, Google has't had a problem with them before)

korkus2000

12:36 pm on Jan 20, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Welcome to WebmasterWorld daroz,

Googlebot doesn't do well with session strings in urls. They are constantly changing and archiving these links will not help the index. Googlebot also doesn't like dynamic urls that have many value pairs. Googlebot will not except cookies either. So what really changed that made googlebot not like it anymore?

ciml

2:50 pm on Jan 20, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Welcome to WebmasterWorld and the Google News forum, daroz.

Unless the & character is followed by whitespace (eg. space, tab, newline); it must be escaped in SGML as well as XML, so this is an old issue. Unfortunately, search engines have had problems with entities for as long as I can remember.

If Google is failing to de-entify META refresh attribute values before using the URL, then hopefully someone at Google will notice and fix the bug. I notice that Google deals with ampersands in titles now, which is an improvement on past behaviour if I remember correctly.

daroz

5:54 am on Jan 21, 2003 (gmt 0)

10+ Year Member

korkus...

The only change to the pages since September 2002 when the redesign was commited (and included in the October index) were two basic things:

1. Content update. (We're taling some image changes and text changes. Mabye an HTML table or two was added/removed, no mets tags, or javascript has changed).

2. The pages we revalidated over the end of 2002 to be XHTML 1.0 Transitional compliant. (Many links were not being escaped with & and those were corrected).

ciml...

I hope so too, I'm sending them a snippit of the last googlebot visit on the 11th so that might reproduce it.

I'm not fond of serving Googlebot and Visitors 2 different things, but I may be forced to do that. (Disable the Flash Check page and Sessions)