homepage Welcome to WebmasterWorld Guest from 54.227.5.234
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

This 51 message thread spans 2 pages: 51 ( [1] 2 > >     
Semantic HTML: Does mark-up provide enough meaning in web documents?
Opinions?
createErrorMsg




msg:559124
 6:46 pm on Nov 6, 2005 (gmt 0)

You've got a div which contains a single peice of data: the serial number of a product, for example.

Options for mark-up include placing the data in a paragraph tag, or leaving it as just the raw text with the div acting as it's element.

Which is, in your opinion, semantically correct?

Things to consider:
(a) Technically, it's not a paragraph, as a paragraph is by definition a collection of sentences. This is only a 14 digit number.
(b) Technically, you can't just put text (it's numbers, but that's still text) uncontained into a block level element. I.e., it's not semantically correct to place it in the container without an appropriate block level parent element.

What would you do? Which do you think is more in line with the ideals of semantic mark-up? Do you buy into the idea that a div, as a generic block level container, is a good enough container for text, or does text need to be contained in an element with semantic value, even if the semantic value doesn't precisely match the purpose of the text? What about the idea of nesting the text in a span set to display:block? How semantic, or non-semantic, is that?

This question is similar to an earlier discussion about semantic markup in poetry [webmasterworld.com]. I know what I think, but I'm not entirely convinced that I'm right. I'd like to know what opinion others have.

cEM

 

moltar




msg:559125
 7:10 pm on Nov 6, 2005 (gmt 0)

You raise a very interesting question! I often think about that too.

Usually I tend to go for an extra <p> tag, but you could also consider <code> element, or you could even justify <kbd> (if the serial code has to be entered).

I definetely wouldn't use <span> though.

encyclo




msg:559126
 1:01 am on Nov 7, 2005 (gmt 0)

Perhaps we should ask: is the problem really the content or the
div itself? Why is there a div in the first place rather than an unordered list, a definition list or even a table?

If we are looking uniquely at how to handle a product serial number, it is best to place it in relation to its context. For example:

<dl>
<dt>WidgetCo WebCam</dt>
<dd>1234567890</dd>
<dt>WidgetCo deluxe camcorder</dt>
<dd>0987654321</dd>
</dl>

or:

<table summary="WidgetCo products">
<tr>
<th id="product-name">Product name</th>
<th id="serial-number">Serial number</th>
</tr>
<tr>
<td headers="product-name">WidgetCo WebCam</td>
<td headers="serial-number">1234567890</td>
</tr>
<tr>
<td headers="product-name">WidgetCo deluxe camcorder</td>
<td headers="serial-number">0987654321</td>
</tr>
</table>

It is not just the meaning of a piece of data in isolation, but its relationship with other data on the same page and externally (via links).

JAB Creations




msg:559127
 2:24 am on Nov 7, 2005 (gmt 0)

This page will hopefully give you a quick reference through which you can pick and choose tags a little better suited to what you are doing...

[w3.org...]

According to the W3C...

The DIV and SPAN elements, in conjunction with the id and class attributes, offer a generic mechanism for adding structure to documents.

- [w3.org...]

So if no other tags seem to describe this piece of information you would be best of using the span element.

tedster




msg:559128
 2:31 am on Nov 7, 2005 (gmt 0)

Where does the concept originate that a div must have a child element to hold its content?

I have pages that validate html 4.01 strict and they contain product information placed directly inside a div with no paragraphs, spans or other other containers -- just the div, and it validates.

JAB Creations




msg:559129
 3:03 am on Nov 7, 2005 (gmt 0)

Let me give the W3C's original quote a little more perspective...

The DIV and SPAN elements, in conjunction with the id and class attributes, (each independently) offer a generic mechanism for adding structure to documents.

Again it's perspective as I have not seen the W3C suggest that divs and spans be used. Remember that divs are block level elements while spans are inline level elements. I think (Ted) that you confused the fact that I suggested them both at the same time without enough clarification.

HTML 4 as far as the validator is currently concerned allows <body><span> while XHTML 1.1 will not validate with inline elements in the same manner.

So if you're using HTML 4 you are allowed to use inline elements in the body without the usage of parent elements, at least according to the validator. I'm not sure what the W3C says about this specific topic.

I would assume an XHTML 1.0 transitional/strict doctype would also perhaps require inline elements to be inside parent elements to validate?

Purple Martin




msg:559130
 3:35 am on Nov 7, 2005 (gmt 0)

Why is there a div? Is the div doing anything useful, for example positioning/styling?

Who said that "a paragraph is by definition a collection of sentences"? I've seen paragraphs that have only one word and yet are still proper paragraphs.

Is this the only product id on the page, or are there many products, each with their own id?

createErrorMsg




msg:559131
 4:01 am on Nov 7, 2005 (gmt 0)

a quick reference through which you can pick and choose tags

I think maybe I wasn't clear in my original post. Mine was a question of theory, not a specific question related to a particular coding situation. Referring me to the W3 documents about which tags are available doesn't confront the question. I know what tags there are. The fact is that certain bits of information fall through the cracks in the mark-up system and I was looking to hear what people thought about what ought to be done with those bits.

We all know the old standard about DIV and SPAN allowing authors to add their own structure to a document, but such structure holds no semantic meaning. Putting, for instance, a serial number, product cost, or day/month/year, in a span inside a div isn't any more semantically meaningful than simply putting the information in the div uncontained. For that matter, it's not any more semantically meaningful than putting it directly into the body of the document.

But I have a problem with uncontained stuff. Ted raises a good point: why should I, or we, have a problem with putting raw text into a div? Personally, I think I've just become so used to using divs as "large-scale" containing blocks that I have a resistance to putting things inside of them without an accompanying semantic parent.

And this is what I was wondering. Do others feel the same way? If so, any idea WHY we've come to feel this way?

Why is there a div in the first place rather than an unordered list, a definition list or even a table?

Encyclo, you, of course, raise a valid point. Discover the meaningful context in which the information resides and go from there. However, even your suggested solutions are mere approximations of meaning. DT and DD, for instance, establish a term/definition relationship, but does this really mirror the relationship between a product name and its serial number? The point, of course, is that an argument could be made either way. And when that happens, how should we go about picking the right one to use? A definition list, a table, a header and paragraph, generic divs and spans? What's the criteria?

And then what about when the demands of the design stand in the way? What if I need, for instance, both the div AND the text containing element in order to attach styles to? What if there is no sensible way to achieve layout demands using the structure of a definition list - or any established structure at all? CSS is extremely versatile, but it does have it's limitations, and I'm sure we've all experienced times when the contortionist tricks needed to wrench Structure A into Layout B compromise other principals of good coding. In such a situation, how do we find the right middle ground? Do we use generic, non-structure (div and span), or fudge it with structure that is "close enough"?

Obviously, there is no right answer to any of this, only opinions and personal philosophies, but that's exactly what I'd love to hear. What's your take on this? Which parts of validity and semantics and "good coding" are the most important ones?

cEM

[added]
Why is there a div? Is the div doing anything useful, for example positioning/styling?

Assume so, yes. If the div is necessary, what do you do with the text inside it that has no clear-cut place in the world of semantic markup?

Who said that "a paragraph is by definition a collection of sentences"?

My English degree. ;) But seriously, that's what a paragraph is. If that isn't what a paragraph is, then what is it? And if we conclude that it is "any peice of text of any length" then we've essentially turned it into a div with some default padding, not a semantic element at all.

Granted, a collection of sentences can be one sentence long, and that sentence could be just one word, but not every word in isolation makes a proper sentence. For instance..

The.

...is not a proper sentence. I think the definition needs to be something more substantial.
[/added]

tedster




msg:559132
 4:15 am on Nov 7, 2005 (gmt 0)

Some very interesting points here.

I do use divs for something other than semantic structure -- most especially for giving more VISUAL structure to the page. I made a strong study of print typography before the web was born, and I find that divs and stylesheets give me a way to bring some of that sensibility to a web page.

As an example, I will change the padding or font (or background-color) for a section of text just to give it some way to stand out visually from the surround. If such a div has any semantic meaning at all, it's only "this is more important than the rest of the page" or "this is summary information". I suppose that is semantic information of a kind, though, no?

In fact, at some point the line between visual and semantic information begins to get very fuzzy, as many apparent lines do when we get up real close.

Aside to JAB_Creations: my comments were not made
in response to your post, but rather to the opening post.

rocknbil




msg:559133
 4:24 am on Nov 7, 2005 (gmt 0)

IMHO (as one who breaks rules of semantics without regard, :-) ) I say if semantics are important you make your own XML tag

<serial>1234567</serial>

But forgiving XML, I would go with the generic div.

JAB Creations




msg:559134
 4:51 am on Nov 7, 2005 (gmt 0)

CEM - Is this serial the only on the page? Are there other serials? If there are other serials it might hint at using tables?

bedlam




msg:559135
 7:07 am on Nov 7, 2005 (gmt 0)

Interesting question. I have the distinct feeling that a lot of html coders (including me...) sometimes want html to be xml ;-)

We all know the old standard about DIV and SPAN allowing authors to add their own structure to a document, but such structure holds no semantic meaning.

I don't agree. As has already been mentioned, it's quite clear that the w3c's opinion is that using generic elements with classes is a way of marking up content in a meaningful way.

This relates to the question I asked in the poetry thread and in my 'heavy markup' offshoot thread: why worry about the 'semantic' aspects of markup at all? The only reasonable answer I've been able to come up with (aside from 'hygenic' considerations...) is that it's potentially useful in case the content in question will ever be transformed, mined or otherwise repurposed using dom scripting, xslt or similar or in case it is expected to be accessed using different kinds of useragents.

Provided that I haven't overlooked anything obvious, I think this suggests that <div class="serial blue-widget">123456</div> is at least meaningful enough to work as well as a dedicated '<serial>' element.

I think that when we take this line of inquiry, we wind up asking the question 'if such markup is not meaningful in the same way as a specific element, what is the relevant difference? E.g. why is one meaningful and not the other?'

Who said that "a paragraph is by definition a collection of sentences"?

My English degree. ;) But seriously, that's what a paragraph is. If that isn't what a paragraph is, then what is it? And if we conclude that it is "any peice of text of any length" then we've essentially turned it into a div with some default padding, not a semantic element at all.

I think there are very good reasons for thinking that the way the authors of the html 4 specs thought about paragraphs was in a more or less purely typographical way; to repeat an oft-used example, if a paragraph is a unit of meaning, then it should arguably be able to contain an un/ordered list (at least there is nothing about an un/ordered list that I'm aware of that should prevent its belonging to a 'collection of sentences').

On the other hand, if a paragraph is a simple typographic construct (some arbitrary number of words followed by a hard return), then the <p> element should behave pretty much as it does -- and I have to admit, that I think <p> in html pretty much is or is almost '...not a semantic element at all'. If it is a 'semantic' element in the first place, it's nowhere near as rich in that sense as any of the lists or even the address element...

-B

Purple Martin




msg:559136
 3:22 am on Nov 9, 2005 (gmt 0)

Who said that "a paragraph is by definition a collection of sentences"?

My English degree. ;) But seriously, that's what a paragraph is.

Here is an extract from "Around The World In Eighty Days". Notice that one of the paragraphs contains a single word.

‘But have you got the robber’s description?’ asked
Stuart.
‘In the first place, he is no robber at all,’ returned
Ralph, positively.
‘What! a fellow who makes off with fifty-five thousand
pounds, no robber?’
‘No.’
‘Perhaps he’s a manufacturer, then.’
‘The Daily Telegraph says that he is a gentleman.’

If that isn't what a paragraph is, then what is it? And if we conclude that it is "any peice of text of any length" then we've essentially turned it into a div with some default padding, not a semantic element at all.

That's a very interesting question. I think that a <p> tag does still convey meaning even when the paragraph only contains one word, whereas a div is just a generic container with no implied meaning.

----------------

Which still leaves us with the question of which tag you should use for your product id.

If there are several products on the page, each with an id, I would be tempted to arrange them in either a table or some kind of list so that each product's id sits in the relevant table or list tag.

JAB Creations




msg:559137
 6:51 am on Nov 9, 2005 (gmt 0)

I think we need someone who is naturally very fluent in both technical and English! You usually get one of the other but I've never really (in person/personally) seen both! ;)

moltar




msg:559138
 2:36 pm on Nov 9, 2005 (gmt 0)

createErrorMsg appears to be one of them. He has an English degree and he is very good at CSS (he is the forum mod after all!).

createErrorMsg




msg:559139
 3:51 pm on Nov 9, 2005 (gmt 0)

Notice that one of the paragraphs contains a single word.

Absolutely. To reiterate, though: a paragraph is by definition [answers.com] a collection of sentences. That collection may, given certain circumstances (usually in either a narrative peice or in dialog, as in your example), be only one sentence long. Furthermore, that one sentence could conceivably consist of just one word. However, this does not mean that every instance of a single word surrounded by hard returns is a sentence. For example...

A.

The.

At.

Not sentences. Not paragraphs.

In English usage, a paragraph is part of a heirarchy of meaning that starts with memes and builds up through words, clauses, sentences, paragraphs, and so on to the document whole. As a unit of meaning in expository writing, it is defined as a collection of sentences which combine to deliver a main idea and supporting evidence. That meaning is somewhat looser in narrative, dialog and a few other forms of writing, but even then you'll notice that a properly structured paragraph is used to deliver a unit of meaning - one persons turn in speaking, emphasis on a particular term or phrase, etc.

In other words, there is a rhyme and reason to where paragraph breaks occur in written language, and it is not - or should not be - an arbitrary insertion of a hard return. This, of course, is not to imply that anyone here said it was, but only to support the argument that a paragraph serves more than just a typographic function. It is an integral part of making written language comprehensible.

However, bedlams idea that a paragraph in a web document is intended to be considered typographically is an interesting notion. I would be interested to see if anything in the W3 documentation supports that. I think knowing that intended use of the paragraph in the Semantic Web would go a long way toward settling this. If a purely typographical construct, then the concerns raised by ambiguous applications are fairly well resolved: hard return = paragraph. If not, we're still in la-la land.

[added]
I'm beginning to think that bedlam may be right. In reading what I'm capable of understanding in the Semantic Web documents (ugh), it seems to me that they pretty quickly moved beyond the use of HTML markup as a way to deliver meaning within the structure of a document. The progression, according to several documents, seems to have been META tags, semantic HTML markup tags, XML, and finally RDF. XML and RDF allow authors to assign the meaning they need to a document in a far more specific way than HTML allows. From Simon Willisons webblog:

Sure, marking something up as a paragraph or header is more meaningful than leaving it in a semantically uninteresting div or span, but to be truely semantic the markup would need to tell us what the element actually is - a headline, or the author of an article, or a list of navigation options for the page. That kind of information is the realm of XML.

So bedlam, your point is well taken.
[/added]

cEM

jetboy




msg:559140
 4:14 pm on Nov 9, 2005 (gmt 0)

In *isolation*:

<div class="serial">12345</div>

But the semantic weight of a piece of data is not only derived from the actual data, but also the data around it. If the data around it is logically connected, as in encyclo's first example, then the rules change.

<dl>
<dt>Product</dt>
<dd>
<ul>
<li class="serial">12345</li>
<li class="colour">red</li>
<li class="weight">1 ton</li>
</ul>
</dd>
</dl>

The serial number is part of a *list* of attributes that go towards *defining* the nature of the product. Whether that's a wholly legitimate use for a definition list is open to debate, but to answer the question:

does text need to be contained in an element with semantic value, even if the semantic value doesn't precisely match the purpose of the text?

I'd say, "Maybe" ;) It doesn't *need* to be, but it can add a perceived semantic structure to your document if you do.

bedlam




msg:559141
 8:47 pm on Nov 9, 2005 (gmt 0)

A.

The.

At.

Not sentences. Not paragraphs.

Ever read any Faulkner? ;-D

a paragraph serves more than just a typographic function. It is an integral part of making written language comprehensible.

Well, paragraphs are a relatively late development in the history of written languages -- they didn't make an appearance in European books until well after the introduction of movable type [images.google.com].

There's no 'just' to typography; typography's most important use is precisely making written language comprehensible. It's just that it uses spatial and visual methods rather than metadata.

I would be interested to see if anything in the W3 documentation supports that

Sorry, I don't have any direct documentary evidence; it's a conclusion I've drawn based on how <p> can be used. I think that, as in other issues with the specs that we've discussed on the boards, the HTML spec [w3.org] is not sifficiently clear to resolve this question. It says

Authors traditionally divide their thoughts and arguments into sequences of paragraphs. The organization of information into paragraphs is not affected by how the paragraphs are presented: paragraphs that are double-justified contain the same thoughts as those that are left-justified.

On one hand, this seems to argue against what I'm supposing; note that it's concerned with the structuring of information and separation of that structure from style.

But on the other hand (and there may obviously be historical, pre-w3c reasons for this), <p> is just not allowed to be used in ways that are obviously meaningful (and that clearly correspond to common uses in other written texts).

Again, my favourite simple example is the list. This:

<p>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce velit. Maecenas justo sem, vulputate ac, accumsan nec, feugiat vitae, lectus. Ut imperdiet.
<ul>
<li>Lorem</li>
<li>Ipsum</li>
<li>Dolor</li>
<li>Sit</li>
<li>Amet</li>
</ul>
Vestibulum lacus. Pellentesque semper, odio ut consectetuer consequat, ante turpis porttitor diam, eu tempor neque nisi quis lorem.
</p>

...is invalid as HTML, but is a perfectly reasonable way to build a paragraph if a paragraph is a unit of related information. I've just never heard any reasonable 'semantic' explanation for this behaviour. The same goes for <blockquote>.

But the semantic weight of a piece of data is not only derived from the actual data, but also the data around it.

This may be a good point. Context does seem relevant to this question, but I don't know if it alone can help much to resolve it. Your example (whose markup I have changed to reflect what I think is a better use of the <dl> element):

<dl>
<dt>Product</dt>
<dd class="serial">12345</dd>
<dd class="colour">red</dd>
<dd class="weight">1 ton</dd>
</dl>

...just moves the question one layer outward in the markup. Where the question was about the serial number and how to mark it up, it's now about the product; in other words, should the dl now be marked up as <dl class="product">? Also notice that with a more natural use of the dl / dt / dd elements than in the given example, the structure needs to have classes in order to be comprehensible (as did the li elements in the previous sample).

Again, I find it useful to think about using the DOM to, for example, highlight all elements of a certain type on a page, or using xslt to transform this content to or from xml. With the classes, the dd elements can be operated on based on the specific kind of information they contain, regardless of the order in which they occur. Without id or class attributes, this is impossible.

-B

asomervell




msg:559142
 9:29 pm on Nov 9, 2005 (gmt 0)

I'm honestly dumbfounded by the notion that divs should be used in text. I can understand the use of spans WITHIN a paragraph to apply style, such as <span style="color:#FF0000">Angry little man</span> but...

The DIV and SPAN elements, in conjunction with the id and class attributes, offer a generic mechanism for adding structure to documents.

My interpretation of this, and there's no convincing me otherwise, is that a div should be used to position a block of paragraphs and headings etc. which is exactly "adding structure to documents."

<p> tags should be used for paragraphs. A paragraph should be defined as the collection of one or more sentences.

<h#> tags should be used for various sized headings

If you want to put 10 paragraphs in the middle of a page, use a DIV to get them there.

bedlam




msg:559143
 9:38 pm on Nov 9, 2005 (gmt 0)

My interpretation of this, and there's no convincing me otherwise, is that a div should be used to position a block of paragraphs and headings etc. which is exactly "adding structure to documents."

I think you've slightly missed the point; 'structure' in this sense means 'meaning'. You can see that this is what's meant in the w3c's Introduction to HTML 4 [w3.org]:

HTML has its roots in SGML which has always been a language for the specification of structural markup. As HTML matures, more and more of its presentational elements and attributes are being replaced by other mechanisms, in particular style sheets. Experience has shown that separating the structure of a document from its presentational aspects reduces the cost of serving a wide range of platforms, media, etc., and facilitates document revisions.

Notice how the wording clearly distinguishes 'structure' from 'presentational aspects' which are handled by 'style sheets' -- these 'presentational aspects' include positioning [w3.org].

-B

jetboy




msg:559144
 10:00 pm on Nov 9, 2005 (gmt 0)

I'm honestly dumbfounded by the notion that divs should be used in text. I can understand the use of spans WITHIN a paragraph to apply style, such as <span style="color:#FF0000">Angry little man</span>

Again, all about the context. If the serial number was part of a greater inline structure within a block level element:

<p>The serial number of the widget is <span class="serial">12345</span> which makes it a blue widget</p>

then you'd be absolutely right. However, in isolation, it has to be contained in its own block level element, and so the question is which block level element?

asomervell




msg:559145
 10:02 pm on Nov 9, 2005 (gmt 0)

Why the crusade to drop tags that have semantic meaning though? The ID and layout of the divs for the document DO give meaning but you dont take them right the way down to the last word! I'll put some code together...


<html>
<body>
<div id="header"><img src="images/logo.jpg" alt="ACME" /></div>
<div id="pageSurround">
<div id="colLeft">
<ul class="navLeft">
<li>List Item One</li>
<li>List Item Two</li>
</ul>
</div>
<div id="colMid">
<h1>Big Product</h1>
<p>Sentence One.</p>
<p>Sentence Two. Sentence Three.</p>
<h2>Random Poetry</h2>
<p>Line One<br/>
Line Two</br>
Line Three</p>
<h2>Product Specs</h2>
<dl>
<dt>Code</dt>
<dd>ERD3014</dd>
<dt>Serial</dt>
<dd>32424442345</dd>
</dl>
</div>
<div id="colRight">
<img src="images/productimage.jpg" alt="ERD3014" />
</div>
</div>
</body>
</html>

Change that to how you'd do it...

asomervell




msg:559146
 10:10 pm on Nov 9, 2005 (gmt 0)

Jetboy, thats exactly how i'd use a span...

I'm obviously continuing on a bit from the poetry thread so... back to having a single piece of data:

Where is it on the page cEM? does it have "Serial:" to the left of it? If not why are you displaying it and how does someone know what it is?

Put it in context, is it:


<h1>Product Title</h1>
<span class="serial">12345678910111</span>
<p>Text about the product.</p>

?

JAB Creations




msg:559147
 11:08 pm on Nov 9, 2005 (gmt 0)

If there is a presentational issue where using a DIV (a block level element) does not or allow through CSS the desired formatting then I would go with a SPAN.

However if the DIV (which will naturally stretch to be 100% wide) does not interfere with visual presentation of the page then I suppose you could go with <div id="serial">?

jetboy




msg:559148
 11:18 pm on Nov 9, 2005 (gmt 0)

If there is a presentational issue where using a DIV (a block level element) does not or allow through CSS the desired formatting then I would go with a SPAN.

No, no no. Presentation should have nothing to do with. That's CSS. <div>s are block-level elements, <span>s are inline. That's a fundamental difference.

For the record, if you put this at the top of your stylesheet:

div { display: inline; }
span { display: block; }

then <div>s would act like <span>s, and <span>s would act like <div>s. Does it mean they're interchangeable? No.

Purple Martin




msg:559149
 1:30 am on Nov 10, 2005 (gmt 0)

In reading what I'm capable of understanding in the Semantic Web documents (ugh), it seems to me that they pretty quickly moved beyond the use of HTML markup as a way to deliver meaning within the structure of a document. The progression, according to several documents, seems to have been META tags, semantic HTML markup tags, XML, and finally RDF.

I can see a problem with allowing web-page semantics to progress beyond HTML to XML. The great thing about HTML tags is that they are defined in a universal standard, so that everyone knows exactly what each tag means. The trouble with XML is that everyone can make up their own tags.

For example: you could invent an XML tag called <productid>, I might invent an XML tag called <prod_id>, and someone else might invent an XML tag called <idProd>. All three tags were intended for the exact same purpose with the exact same meaning, but with everyone inventing their own tags for their own web pages, how is a reader supposed to draw meaning from anything?

XML is great for data transfer from one application to another, because it can be moulded to fit any required data structure perfectly. But I don't think it should take over from HTML because there is no standard set of XML tags each with a universally-understood meaning.

XML and RDF allow authors to assign the meaning they need to a document in a far more specific way than HTML allows.

That's very useful as long as the document stays within a controlled group of users who have all been supplied with a definition of the meaning of every custom tag. As soon as a document gets released into the wild (as every web page is) the XML tags become meaningless, because the meaning of each custom tag hasn't been supplied to everyone in the world. With HTML, the meaning of every tag has been supplied to everyone in the world, in the form of a W3C Recommendation.

rjohara




msg:559150
 4:10 am on Nov 10, 2005 (gmt 0)

XML is great for data transfer from one application to another, because it can be moulded to fit any required data structure perfectly. But I don't think it should take over from HTML because there is no standard set of XML tags each with a universally-understood meaning.

XML and RDF allow authors to assign the meaning they need to a document in a far more specific way than HTML allows.

That's very useful as long as the document stays within a controlled group of users who have all been supplied with a definition of the meaning of every custom tag. As soon as a document gets released into the wild (as every web page is) the XML tags become meaningless, because the meaning of each custom tag hasn't been supplied to everyone in the world. With HTML, the meaning of every tag has been supplied to everyone in the world, in the form of a W3C Recommendation.

Purple Martin is correct that the virtue of HTML has been (and is) that its limited set of elements are universally recognized. That's why HTML took off like a rocket in the 1990s, whereas SGML (from which HTML was derived) had been in existence since the 1960s and nobody except people who programmed things like service manuals for airplanes had ever hear of it.

But as CEM observed in the starting post for this thread, HTML is very limited in its element-set. All of us who work in any special field discover this very quickly when we want to label something as a <serial-number>12345</serial-number> or as a <binomen><genus>Progne</genus><species>subis</species></binomen>.

The names of the allowed elements in HTML constitute what librarians call a "controlled vocabulary," and controlled vocabularies are essential for accurate work in many spheres of activity because if people use varying and irregular terminology then communication is almost certain to fail. (And since computers are much stupider than people, they rely on controlled vocabularies even more.) Learning to work in a particular field often means learning the controlled vocabulary for that field: for a long time (and it may still be so) if you were looking for books about World War I in a Dewey Decimal-based library catalog, you had to look under the heading "The Great War" (as WWWI was initially called). Most academic libraries have two big red books in the reference area that contain the Library of Congress Subject Headings: the controlled vocabulary of terms under which books are cataloged. (The Google Generation is used to looking up any random keyword, but that's not how specialized work is done in most fields.)

So what's the solution to this problem:

As soon as a document gets released into the wild (as every web page is) the XML tags become meaningless, because the meaning of each custom tag hasn't been supplied to everyone in the world. With HTML, the meaning of every tag has been supplied to everyone in the world, in the form of a W3C Recommendation.

The solution will be found, as CEM notes, in the things that are being developed under the heading of the Semantic Web. What this boils down to is the ability of groups of people to establish their own controlled vocabularies which are then made public and so are accessible to "everyone in the world." This will allow me to, say, create a webpage containing the markup <binomen><genus>Progne</genus><species>subis</species></binomen>, and also containing a pointer to a public resource (call it "biological systematics: controlled vocabulary") that defines the special (non-HTML) elements I've used. If you view my webpage, your browser will say to itself, "Hmm, I need to download the set of definitions in 'biological systematics: controlled vocabulary' before I can finish rendering this document correctly."

Some of the first visible steps being taken in this direction are happening in the blogging world. The XHTML Friends Network [gmpg.org] is a great example: it defines a controlled vocabulary of attribute/value pairs for HTML anchor elements that specify a variety of personal relationships. There will certainly be more things like this appearing in the not-so-distant future.

createErrorMsg




msg:559151
 4:26 am on Nov 10, 2005 (gmt 0)

Purple_martin, I agree wholeheartedly with everything you just said. I can't see how allowing authors to define their own tags is anything but a step backwards from standardization. My only concern is this...

so that everyone knows exactly what each tag means

Unfortunately, I think the discussion here indicates that everyone doesn't know what each tag means, since we've got several definitions going, all with perfectly valid reasons behind them, for one of the most basic and ubiquitous tags around...the plain <p>. Is it a unit of typography, or a unit of meaning? It was alluded above by bedlam that the two are the same, but I disagree. The typography of a paragraph is that is has a hard return on either side and (usually) an indent. But the decision of where to start a new paragraph and where not to is almost always based on it's role as a unit of meaning. I've finished pursuing a particular thread of thought, so I break to a new paragraph. That a double line break and a 3em text indent happens to accompany it is by-the-way.

rjohara, your post deals directly with this and brings the cryptic information in the Semantic Web stuff right into perspective. When you say that controlled vocabularies will be defined, I take this to mean within various areas of study? Is the intent that such vocabularies be stored in a central location, like the DOCTYPE definitions at the W3, or will fields of study establish their own local locations for these things? I'm facinated by the possibilities that this sort of collaborative effort could produce, if not simultaneously skeptical about its success.

cEM



Note: In an earlier post I wrote that a paragraph is part of...

a heirarchy of meaning that starts with memes and builds up through words, clauses, sentences, paragraphs, and so on to the document whole

The word "meme" was a mistype. Meaning units in language start with morphemes - word parts used to convey meaning, such as prefixes, suffixes, and root words - not memes. A meme is something else, entirely. Sorry about that.

lexipixel




msg:559152
 5:23 am on Nov 10, 2005 (gmt 0)

One thing seems to be discussed less and less;

economy of data

...(especially now that we can say more and more, and say it faster and faster).

To send the string:

0123456789<BR>

is 14 bytes of data, and;

<p>0123456789</p>

is 15 bytes, and;

<div>0123456789</div>

is 21 bytes, and;

<div class="serial number">0123456789</div>

is 43 bytes...

It will takes 200% more time to transmit the last markup/data example than the first.

What can you gain in exchange for the 1/3 reduction in transmission speed and 900% increase in HTML markup to data ratio?

Most people will probably say; "What's another 20 or 30 bytes?"... But, multiply that by the number of times bloated markup occurs and we could probably be enjoying a minimum of 100% increase in the global speed of the internet.

Since we are getting real nit-picky ---- I wonder if there would also be a fuel savings if we all coded leaner?

Would a data center's electric bill be lower if less data was transmitted and received?

Surely it must cost something to flicker those modem and router lights off and on...

Purple Martin




msg:559153
 5:36 am on Nov 10, 2005 (gmt 0)

rjohara, the idea you've descibed is fascinating. Make each controlled vocabulary publically available, then link to it from each page using that controlled vocabulary. Unfortunately this system is unreliable: if I make a page that uses tags from a controlled vocabulary that is owned by someone else and is not tied down by a W3C standard, then my page is at risk of breaking whenever the owner of the controlled vocabulary changes something or, worse still, moves the vocabulary to a different location. I wouldn't want to link my page to someone else's CSS file, so there's no way I want to link my page to something that provides actual meaning to my page rather than just look'n'feel. It's too risky.

Or did you mean that every XML site should have its own controlled vocabulary, even if it doubles-up a commonly used convention? Sounds like a lot of extra work to me, with too much potential for errors.

The great thing about the HTML controlled vocbulary is that it is limited enough for user-agent developers to build it into each user-agent. Firefox/JAWS/googlebot doesn't have to look up the meaning of an <h1> tag every time it comes across one, it already knows.

------------------------------------

cEM, everyone should know exactly what each HTML tag means, because each tag is defined in the controlled vocabulary that is the W3C standard. The fact that we've all had a long discussion about exactly what a <p> tag is for means one of the following: we have failed to understand the spec or tried to read too much into it, or (less likely) the spec itself is inadequate.

This 51 message thread spans 2 pages: 51 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved