High PR possible with site written in xml?

Forum Moderators: open

Message Too Old, No Replies

High PR possible with site written in xml?

Can the googlebot parse xml 1.0?

SethCall

4:36 am on Dec 18, 2003 (gmt 0)

Ive been thinking of writing a website in xml 1.0 as opposed to xhtml or html 4.01... I was curious as to whether anyone knew of the impact this would have on page rank.

Theoretically, you would think it would help, as parsing xml is programmically more rigid than non-valid html...

GoogleGuy

5:50 am on Dec 18, 2003 (gmt 0)

Erm, I think we can index it as text. But what's wrong with, you know, html? I was trying to surf an xml doc in an old version of Netscape earlier today. Barf, said the browser.

SethCall

6:17 am on Dec 18, 2003 (gmt 0)

GG, when you say index it as text, does that mean it will treat it like as if it encountered a .txt file, for instance?

Nothing is wrong with html: after all, xhtml is just xml-compliant markup that corresponds to the html 4.01 schema, basically. Still, your page still isnt *quite* xml. And the way things are currently going, you can almost bet that pages in the future will be written mainly xml( or at least a growing percentage)

... But in terms of programming generated pages in .net, for instance, its a breeze from a programmers point of view to generate xml content with stylesheets to make it render appropriately.

Unfortunately, only gecko-based browsers and opera can hanndle it...

But, I have faith that the next version of IE can handle it.

If not, XSLT can then be easily encorporated for IE users.

Bottom line, I like playing with the latest and greatest, even though its not always commercially feasible.

tantalus

11:42 am on Dec 18, 2003 (gmt 0)

I'm amazed at GG's response to this just as much as I was amazed at his response to similar question several months ago.

I think the question should be could google index the xml and the associated xslt style sheet with its h1 tags anchors alts etc.

Or more succinctly what factors would google use to rank an xml document.

The xhtml standard is I believe supposed to be transition or bridge from html to xml compliancy for the web as whole according to W3 if i remember correctly.

Anyone please feel free to correct anything above.

John_Caius

11:56 am on Dec 18, 2003 (gmt 0)

We run server-side XSL/XML so that what is presented to the user appears exactly the same as HTML.

Excel

12:02 pm on Dec 18, 2003 (gmt 0)

What's the advantage of XML over HTML?

tantalus

12:29 pm on Dec 18, 2003 (gmt 0)

What's the advantage of XML over HTML?

Seperation of presentaion from content is one. There are many others though.

You're right though john I was thinking in terms of client side transformation...Still I'm amazed that we have to transform a new standard into an old standard for likes of someone like Google.

"I was trying to surf an xml doc in an old version of Netscape earlier today. Barf, said the browser."

The old netscape dosen't have an xml parser. Jeez I'm not trying to be funny but if you're using an old Netscape I really am worried for Google.

[edited by: tantalus at 12:45 pm (utc) on Dec. 18, 2003]

trillianjedi

12:38 pm on Dec 18, 2003 (gmt 0)

What's the advantage of XML over HTML?

Seperation of presentaion from content is one. There are many others though.

2. Integration with SOAP engines.

3. A far more "strict" mark-up language than HTML.

4. Storage of content data in a tree-like structure.

tantalus

12:48 pm on Dec 18, 2003 (gmt 0)

5. A bridging language for all machine languages

davidpbrown

1:20 pm on Dec 18, 2003 (gmt 0)

? Surely the point to XML is that it holds data as text file.

Doesn't Google's searching the XML as txt find the data?
If there are transforms associated then, in a world of browsers that can do the business, the user would see the data transformed..

I'm surprised people are expecting Google to do the transforms.. but then I am new to XML/XSLT.

My impression was that XHTML 1.1 -> 2.0 was here to stay and XSLT can generate it easily enough.

trillianjedi

1:37 pm on Dec 18, 2003 (gmt 0)

? Surely the point to XML is that it holds data as text file.

Yes, that's the point as far as SE's are concerned and therefore the original post. Didn't mean to take it off-topic, was just answering a question.

davidpbrown

1:45 pm on Dec 18, 2003 (gmt 0)

yes - I wasn't replying to yours trillian.
I think you're a couple of steps ahead of me.. talk of SOAP etc.

tantalus

1:50 pm on Dec 18, 2003 (gmt 0)

Point taken: Its really the attitude of Erm, Barf and But what's wrong with, you know, html? that gets me.

Anyway if you're interested..

I did a search for .xml heres a couple of examples from the serps:

www.****xx.com/weblog/index.xml
File Format: Unrecognized - View as HTML
Similar pages

xx.****.com/index.xml
File Format: Unrecognized - View as HTML
Similar pages

All of them say "File Format: Unrecognized" and all are rss feeds if that makes any difference.

Click on view as html and you get a blank google cache

Marcia

2:02 pm on Dec 18, 2003 (gmt 0)

>>using an old Netscape I really am worried for Google.

I'd be far more worried if Google wasn't interested in being compatible for the entire searching public. There are still large corporations and government agencies using Netscape 4.7.

Under 5% usage isn't much to worry about being inaccessible for with the site Joe_Webmaster's nephew made for him in Front Page that gets maybe 500 uniques a month looking for budget widgets, but when you're getting into billions of searches that's a lot of users whose needs would be neglected.

swerve

2:13 pm on Dec 18, 2003 (gmt 0)

Back to the question at hand, what about PageRank and XML?

1. Does Googlebot follow a <a href> found in an XML page?

2. What about links that are not in the form of <a href>, such as <link>http://www.domain.com</link>. Does/will google follow those?

I guess my main question is, are pure XML pages currently "dead ends" for Googlebots?

davidpbrown

2:13 pm on Dec 18, 2003 (gmt 0)

the attitude of Erm, Barf

well it made me chuckle.. it does sum up what I see browsers do.. my opera always looks queasy when I throw xml at it.

davidpbrown

2:16 pm on Dec 18, 2003 (gmt 0)

swerve

see this thread maybe [webmasterworld.com ]

jranes

3:51 pm on Dec 18, 2003 (gmt 0)

It seems highly speculative to me to just assume that by the time the world is really ready to graduate from html that xhtml / xml is going to be the clear winner. We are years away from leaping from html still and no one knows what the tech de jure of the day will be for browser display.

It seems probable that the days of coding the actual display language will end and a machine language standard will be adopted universally and everything will be done by command text docs or wsiwyg like console apps are done today before the computer world gets turned upside down by xml or xhtml.

Wishing html was dead won't make it so.

davidpbrown

4:22 pm on Dec 18, 2003 (gmt 0)

Of course that Microsoft's new Longhorm platform is based on xml will probably give XML an edge.

Maybe in many years the world will move on beyond XML but I expect it'll be all the rage in ~2006+?.

trillianjedi

4:26 pm on Dec 18, 2003 (gmt 0)

but I expect it'll be all the rage in ~2006+?

I suspect a little sooner than that.

But it is certainly the future path - a search around for tools and applications that use XML as a mark-up language are testament to that.

It's not about browsers. It's about the integration of many platform-independant systems over the internet, of which browsers form part.

I doubt whether HTML will ever be "dead" - it will just move through various incarnations.

XML is not a mark-up display language, it's a protocol.

are pure XML pages currently "dead ends" for Googlebots?

If you link to a "pure" XML page then probably, yes (although google may index the text, but I very much doubt that). Certainly googlebot would not follow an XML <link> style tag.

tantalus

4:51 pm on Dec 18, 2003 (gmt 0)

trillianjedi

I'd love to know whether your attaching an xsl stylesheet to your posts :) oops its just gone.

I quickly looked at the advaned search on google and noticed that in 'return results of the file format' drop down neither .txt nor xml was listed.

It does seem to index the title and follow links to but that seems to be about it.

Hissingsid

4:59 pm on Dec 18, 2003 (gmt 0)

XML is not a mark-up display language, it's a protocol.
are pure XML pages currently "dead ends" for Googlebots?
If you link to a "pure" XML page then probably, yes (although google may index the text, but I very much doubt that). Certainly googlebot would not follow an XML <link> style tag.

I agree.

The problem with XML from the point of view of a search engine robot is what it says on the tin, ie eXtensible Markup Language. There are lots of different namespaces and flavours of XML which is one of its appeals, it can be all things to all men. I guess that if it does get past the XHTML stage then there would probably be a limited range of doctypes and dtds that search engines would be prepared to crawl and parse.

I think that the tail may be wagging the dog for some time to come on this one and who in their right mind is going to produce a web page that SE robots can't crawl when it is actually easier to produce one that they can.

Best wishes

Sid

trillianjedi

5:00 pm on Dec 18, 2003 (gmt 0)

It does seem to index the title and follow links to but that seems to be about it.

That doesn't surprise me.... although "indexing the title" I don't really understand. Are you sure it's not indexing the anchor text of the link to the XML file?

XML is just text. If you create an XML file, but with an .HTML extension, then google will index it. And if you use <a href=> style tags for link structure, then it will probably follow the link and transfer PR through. But it will not validate, and to google it sure will look ugly.

XML is really just a protocol and data storage format. The data from an XML file is parsed into an HTML file for display to the user. And it's the "display to the user" part that google is interested in.

XML is also used to call a function, method or procedure on a SOAP server or other XML based server application over a network. Googlebot would have absolutely no interest in that.

<guess>So I suspect what you're seeing in google is indexed anchor text and nothing more</guess>.

kirkcerny501

5:03 pm on Dec 18, 2003 (gmt 0)

I have a question about xhtml.

All of my site uses xhtml style such as 
and <img src="image.jpg" />.

I get a warning in my html editor that this tag is xml
and not html 4.

Does google like better than 
?

BigDave

5:10 pm on Dec 18, 2003 (gmt 0)

Google needs to worry about the *majority* of their users, who currently do not have browsers that render XML properly. Therefore, they should not rank XML very well.

If you want to play with the latest and greatest, that is fine. Go for it. But if you want to be useful to the greatest number of people, then go with HTML. Google's goal is to serve the many, so they need to be concerned with what works with most browsers.

I expect that xml will gain in popularity, but there is no compelling reason for most sites to change from HTML. There are billions of static pages out there that are going to stay on the web for a long time, and they are owned by people that have no interest in being on the bleeding edge. XML will have to be in the browsers for quite a while before it even starts making a dent in the total number of pages out there.

tantalus

5:54 pm on Dec 18, 2003 (gmt 0)

<guess>So I suspect what you're seeing in google is indexed anchor text and nothing more</guess>.

Kinda right...it seems to use the url as the title, sorry wasn't looking.

kirkcerny501
It might be to do with the doc type you are using...ie

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

"Google needs to worry about the *majority* of their users"

I agree with your sentiments bigdave and, at the end of the day as was said previously all you need do is implement a server side transformation...but what i don't understand is when W3 is trying to set a new standard (this going back two three years now) with all the big guns represented on the comittee ie M$ and IBM etc, I would at least expect Google to show a passing interest. Particularly with potential impact it could have on the net.

Hissingsid

5:57 pm on Dec 18, 2003 (gmt 0)

Does google like better than

Hi,

I have an XHTML site/page which is in SERPs at #3 for its primary term. It has quite a few and /> at the end of image tags this does not seem to harm its SERPs ranking.

From a sample of one I can say that XHTML tags do not seem to effect Google.

Best wishes

Sid

BigDave

6:25 pm on Dec 18, 2003 (gmt 0)

what i don't understand is when W3 is trying to set a new standard (this going back two three years now) with all the big guns represented on the comittee ie M$ and IBM etc, I would at least expect Google to show a passing interest. Particularly with potential impact it could have on the net.

Here is someting that a lot of people in the computer industy just don't understand. Standards bodies don't set standards. They never have and they never will. Even after it is voted in as a standard, it is not a standard.

The only real form of standard is a defacto standard, where it is the one that is actually used.

Look at all the HTML that was depreciated with the introduction of CSS, or the new tags such as . Did those tages really go away with the introduction of the *new* way 6to do things as the "standards" suggest? Hell no! Because the users did not depreciate them. In fact, the older method is usually a lot easier to read.

So what is represented in the HTML 4 docs is not really the standard, Everything new is part of the HTML 4 standard, but those things that have been removed or depreciated from the doc are no less a part of the standard. No one writing a browser will start pretending that they do not know what means, just to meet W3C specs.

I'm sure that there are people in google looking at xml, and they will be ready to fully index it when they decide there is value to it. Probably the best thing XML supporters can do is to start putting up xml pages so that they can reach a critical mass in the supply of pages. They should just understand that if they want to be on the bleeding edge, that it is their blood that will be spilled. Those pages will not rank well for now.

kirkcerny501

9:20 pm on Dec 18, 2003 (gmt 0)

I am using

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

Should I change the above doc type statement, delete it, keep it?

davidpbrown

11:33 pm on Dec 18, 2003 (gmt 0)

kirkcerny501
welcome to WebmasterWorld.

It is better to start new topics in a new thread..
Closed tags such as are XHTML rather than HTML so your a couple of steps ahead of yourself with a HTML 4.01 Transitional header, which is why your getting suggestion that the tag is xml.

You might want to read this thread.. [webmasterworld.com ]

HTML 4.01 Transitional.. Should I change this doc type statement, delete it, keep it?

If you have valid transitional try maybe getting valid 4.01 Strict
Valid DTD's are listed here [w3.org]

This 32 message thread spans 2 pages: 32