Need proof of html 3.x being better for SE's

Forum Moderators: open

Message Too Old, No Replies

Need proof of html 3.x being better for SE's

grnidone

4:08 pm on Apr 17, 2001 (gmt 0)

One of the producers for a client asked me that if they were going to make a web page, which should they use, HTML 3.x or 4.x.

My answer "Probably 3.x. Browsers can all read 3.x because 4 builds on it, and search engine spiders are very friendly to old technology."

I think I am right, but I need proof. Anyone have anything off the top of their head?

-G

Xoc

6:18 pm on Apr 17, 2001 (gmt 0)

I would say use XHTML 1.0, but use the feature set of HTML 3.2.

grnidone

7:42 pm on Apr 17, 2001 (gmt 0)

OK...What is the reasoning behind that decision? I am having a difficult time trying to figure a way to say it such that non techies will understand.

The closest thing I can come up with is that "spiders like old technology." Simple html text is the best.

But I need a little more than that.

-G

msgraph

8:05 pm on Apr 17, 2001 (gmt 0)

All of my pages/sites are in HTML 3.x. Have been and will be for some time to come. Until they firmly acknowledge that they want all submitted sites to be in a new format or else they will not accept them, then I'm straight 3.x.

Spiders do and will for a very long time like old technology. Search engines would much rather have their databases sort through simple html pages with as little tags as possible. Imagine the server loads of having to strip out CSS or whatever tags by the tens of millions. That's why most all of them clearly state on their "help to get listed sections" that they prefer simple text html pages with simple graphics.

Xoc

8:30 pm on Apr 17, 2001 (gmt 0)

I should have said, use XHTML 1.0 transitional using the feature set of HTML 3.2

The W3C is the group that publishes the standards for the web, which can all be found on [w3.org...] . All browsers, spiders, or any other tool that is written for the web is supposed to follow the W3 specifications. If you produce HTML that does not follow the W3 specifications, it is only coincidence if the tool understands the page.

There are various versions of the HTML specifications that have been published over the years. The one that almost every browser and spider understands almost every feature from is HTML 3.2. However, XHTML 1.0 is the current spec on what constitutes a web page according to the W3C. The main difference between XHTML 1.0 and previous versions of HTML is that the rules have been tightened up. So if you use the features from HTML 3.2, but code them using the tightened rules of XHTML 1.0, you make virtually everything that wants to understand your page able to; both old and new browsers and spiders. Furthermore, it means that you don't have to recode your pages in the future.

grnidone

2:52 pm on Apr 18, 2001 (gmt 0)

Thank you Thank you Thank you, both of you. This is exactly what I needed.

-G

Xoc

8:39 pm on Apr 18, 2001 (gmt 0)

This is a blurb off the new tips section of my web site.

Differences between what you are currently doing in HTML and what is necessary for XHTML 1.0. Excerpted from the XHTML 1.0 spec [w3.org], which you can reference for more details and examples. We strongly recommend encoding web pages using the XHTML 1.0 spec, but using the feature set of tags and attributes from the HTML 3.2 spec.

Documents must be well-formed
Element and attribute names must be in lower case
For non-empty elements, end tags are required
Attribute values must always be quoted
Attribute Minimization is not supported
Empty elements must either have an end tag or the start tag must end with />
Whitespace handling in attribute values is different
Enclose script and style elements in CDATA sections
Certain elements cannot be enclosed in other elements
Use the id attribute to identify fragments, not name

A couple of additional points:

Place a space before empty closing tags, as in <br />
Don't use the abbreviation for empty elements where you don't have to. In other words, use <meta></meta> rather than <meta />. The main two you must use empty elements on are <br /> and <hr />. Some of the search engines don't seem to process empty elements correctly.

BoneHeadicus

9:19 pm on Apr 18, 2001 (gmt 0)

Is there a link to ALL of the HTML Doctypes somewhere?

Xoc

10:21 pm on Apr 19, 2001 (gmt 0)

These are the ones that I found from the HTML recommendations on the W3 web site. It, of course, doesn't include any ones in preliminary documents, nor any mutations from improper coding. If it is really important to you, you might convince Brett to unleash his spider to go get a bunch of them.

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"DTD/xhtml1-frameset.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
"http://www.w3.org/TR/REC-html40/strict.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"
"http://www.w3.org/TR/REC-html40/frameset.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN">

<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN">

BoneHeadicus

10:29 pm on Apr 19, 2001 (gmt 0)

So if I use this one:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

Or this one:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

on everything I would be safe?

Xoc

11:11 pm on Apr 19, 2001 (gmt 0)

You should use the DOCTYPE that matches the HTML tags and attributes that you are using. When you specify the DOCTYPE, you are actually identifying a particular DTD (Document Type Definition) file on the web (for example, [w3.org ]) that it compares the rest of the HTML file to. Any tags or attributes that are defined in the DTD are supposed to be interpreted. Any tags or attributes that are not in the DTD are supposed to be ignored, but the text inside the tags rendered. So if you use the DTD for HTML 2.0, <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">, then you shouldn't be able to use any of the features from HTML 3.2, 4.0, 4.01, or XHTML 1.0 that weren't supported in HTML 2.0.

What a web browser really does with markup that isn't in the DTD is it probably renders it using the current rules. But that's not what it's supposed to do. You should conform to what the HTML spec for what you are doing says. If you can't make your document validate, don't put any DOCTYPE specification. Then the web browser will guess what HTML you are doing.

I recommend using this DTD for the moment:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">

But that requires using XHTML 1.0.

BoneHeadicus

11:14 pm on Apr 19, 2001 (gmt 0)

Where to validate for this XOC...or a better way to say it...is there a good validating type program available.

mivox

11:18 pm on Apr 19, 2001 (gmt 0)

Wouldn't the W3C's HTML validator check the validity of your DOCTYPE declarations?

Brett_Tabke

11:33 pm on Apr 19, 2001 (gmt 0)

The thing to remember from all of this, is that it is fairly easy to build a indexer to strip down html 3.2 into it's key entities. Although it isn't that much more difficult with 4.0, there are some constructs that have led me to question whether search engines discount for their usage.

In the same vein, Javascript, vbscript, or other client side scripting can be difficult to parse. The trouble isn't with the standards, it is with the liberties browsers and editors have taken. I've seen no less than 12 different formats for embedding JS/VB in code. Off the wall stuff you'd never expect to run. Search engine indexers can stumble on that stuff and throw out or degrade the whole page. It becomes much worse when you think in terms of site templates where you may replicate the code through out the site. There may be a problem with it and you never even know it until 4 months down the road.

I think the biggest single thing you could impress upon anyone is to run their code through a quality (read: w3c) validator. They may want to take some liberties by leaving out alt tags or questionable constructs, but if you can get it close to w3 acceptable, the search engines should be ok with it.

BoneHeadicus

11:52 pm on Apr 19, 2001 (gmt 0)

I use the w3c but I guess what I'm after here is can I SELECT the doctype I want to validate for...you cant do this with w3c as far as I know.

Everyone I've heard talk about the coming xml thing is saying you MUST use the doctype tag. I surely don't want to go back and redo stuff.

Xoc

12:39 am on Apr 20, 2001 (gmt 0)

I found a nice document that describes every DOCTYPE that the w3c validator knows about. It can be found at [validator.w3.org...] It, however, doesn't include the XHTML DOCTYPES.

Brett_Tabke

2:14 am on Apr 20, 2001 (gmt 0)

It parses the XML doc type out of the document, then points the validator at the catalog file in one of these sub directories:


cr-mat~1 <DIR> 04-01-01 9:55a CR-MathML2-20001113 
iso-html <DIR> 04-01-01 9:55a ISO-HTML 
pr-htm~1 <DIR> 04-01-01 9:55a PR-html40-19990824 
pr-xht~1 <DIR> 04-01-01 9:55a PR-xhtml1-19990824 
pr-xht~2 <DIR> 04-01-01 9:55a PR-xhtml1-19991210 
rec-ht~1 <DIR> 04-01-01 9:55a REC-html40-19980424 
rec-ht~2 <DIR> 04-01-01 9:55a REC-html40-971218 
rec-ht~3 <DIR> 04-01-01 9:55a REC-html401-19991224 
rec-xh~1 <DIR> 04-01-01 9:55a REC-xhtml1-20000126 
wd-htm~1 <DIR> 04-01-01 9:55a WD-html-in-xml-19981205 
wd-htm~2 <DIR> 04-01-01 9:55a WD-html-in-xml-19990224 
wd-htm~3 <DIR> 04-01-01 9:56a WD-html-in-xml-19990304 
wd-htm~4 <DIR> 04-01-01 9:56a WD-html40-970708 
wd-htm~5 <DIR> 04-01-01 9:56a WD-html40-970917 
wd-xht~1 <DIR> 04-01-01 9:56a WD-xhtml1-19991124 
cougar  <DIR> 04-01-01 9:56a cougar 
mod  <DIR> 04-01-01 9:56a mod 
old  <DIR> 04-01-01 9:56a old 
pro  <DIR> 04-01-01 9:56a pro 
sp-1 3 <DIR> 04-01-01 9:56a sp-1.3 
spyglass <DIR> 04-01-01 9:56a spyglass

bill

5:16 am on Apr 20, 2001 (gmt 0)

How difficult is it to write and maintain your own DTD? Assuming you wanted to use a W3C based DTD and just add a few attributes that might not be standard, are there major drawbacks to this?

Xoc

8:12 pm on Apr 20, 2001 (gmt 0)

Well, I was going to say no, then I changed my mind to yes, then I changed it back to no again! I think it's okay. You can copy one of the w3c DTDs and modify it for your needs. I did this in one case because I wanted to validate the entire web site, but there was one automatically generated file that I didn't have complete control over the HTML, which was a mess. So I created a DTD and stuck the DOCTYPE at the beginning of the HTML file. That worked, more or less. The WDG validating spider ([htmlhelp.org ]) takes forever to run if even one file on the site doesn't validate.

My one caveat is that I can imagine a possibility that a browser might do something unexpected if it sees an unknown DOCTYPE at the beginning.

bill

1:52 am on Apr 23, 2001 (gmt 0)

correct me if i'm wrong, but...

My one caveat is that I can imagine a possibility that a browser might do something unexpected if it sees an unknown DOCTYPE at the beginning.

Wouldn't you just use the same DOCTYPE declaration and just point the browser to a different DTD file maintained on your own server?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.mydomain.com/loose.dtd">

Xoc

4:59 pm on Apr 23, 2001 (gmt 0)

Probably would work.