Forum Moderators: open
My answer "Probably 3.x. Browsers can all read 3.x because 4 builds on it, and search engine spiders are very friendly to old technology."
I think I am right, but I need proof. Anyone have anything off the top of their head?
-G
The closest thing I can come up with is that "spiders like old technology." Simple html text is the best.
But I need a little more than that.
-G
Spiders do and will for a very long time like old technology. Search engines would much rather have their databases sort through simple html pages with as little tags as possible. Imagine the server loads of having to strip out CSS or whatever tags by the tens of millions. That's why most all of them clearly state on their "help to get listed sections" that they prefer simple text html pages with simple graphics.
The W3C is the group that publishes the standards for the web, which can all be found on [w3.org...] . All browsers, spiders, or any other tool that is written for the web is supposed to follow the W3 specifications. If you produce HTML that does not follow the W3 specifications, it is only coincidence if the tool understands the page.
There are various versions of the HTML specifications that have been published over the years. The one that almost every browser and spider understands almost every feature from is HTML 3.2. However, XHTML 1.0 is the current spec on what constitutes a web page according to the W3C. The main difference between XHTML 1.0 and previous versions of HTML is that the rules have been tightened up. So if you use the features from HTML 3.2, but code them using the tightened rules of XHTML 1.0, you make virtually everything that wants to understand your page able to; both old and new browsers and spiders. Furthermore, it means that you don't have to recode your pages in the future.
-G
Differences between what you are currently doing in HTML and what is necessary for XHTML 1.0. Excerpted from the XHTML 1.0 spec [w3.org], which you can reference for more details and examples. We strongly recommend encoding web pages using the XHTML 1.0 spec, but using the feature set of tags and attributes from the HTML 3.2 spec.
Documents must be well-formed
Element and attribute names must be in lower case
For non-empty elements, end tags are required
Attribute values must always be quoted
Attribute Minimization is not supported
Empty elements must either have an end tag or the start tag must end with />
Whitespace handling in attribute values is different
Enclose script and style elements in CDATA sections
Certain elements cannot be enclosed in other elements
Use the id attribute to identify fragments, not name
A couple of additional points:
Place a space before empty closing tags, as in <br />
Don't use the abbreviation for empty elements where you don't have to. In other words, use <meta></meta> rather than <meta />. The main two you must use empty elements on are <br /> and <hr />. Some of the search engines don't seem to process empty elements correctly.
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"DTD/xhtml1-frameset.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
"http://www.w3.org/TR/REC-html40/strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"
"http://www.w3.org/TR/REC-html40/frameset.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN">
<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN">
What a web browser really does with markup that isn't in the DTD is it probably renders it using the current rules. But that's not what it's supposed to do. You should conform to what the HTML spec for what you are doing says. If you can't make your document validate, don't put any DOCTYPE specification. Then the web browser will guess what HTML you are doing.
I recommend using this DTD for the moment:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
But that requires using XHTML 1.0.
In the same vein, Javascript, vbscript, or other client side scripting can be difficult to parse. The trouble isn't with the standards, it is with the liberties browsers and editors have taken. I've seen no less than 12 different formats for embedding JS/VB in code. Off the wall stuff you'd never expect to run. Search engine indexers can stumble on that stuff and throw out or degrade the whole page. It becomes much worse when you think in terms of site templates where you may replicate the code through out the site. There may be a problem with it and you never even know it until 4 months down the road.
I think the biggest single thing you could impress upon anyone is to run their code through a quality (read: w3c) validator. They may want to take some liberties by leaving out alt tags or questionable constructs, but if you can get it close to w3 acceptable, the search engines should be ok with it.
cr-mat~1 <DIR> 04-01-01 9:55a CR-MathML2-20001113
iso-html <DIR> 04-01-01 9:55a ISO-HTML
pr-htm~1 <DIR> 04-01-01 9:55a PR-html40-19990824
pr-xht~1 <DIR> 04-01-01 9:55a PR-xhtml1-19990824
pr-xht~2 <DIR> 04-01-01 9:55a PR-xhtml1-19991210
rec-ht~1 <DIR> 04-01-01 9:55a REC-html40-19980424
rec-ht~2 <DIR> 04-01-01 9:55a REC-html40-971218
rec-ht~3 <DIR> 04-01-01 9:55a REC-html401-19991224
rec-xh~1 <DIR> 04-01-01 9:55a REC-xhtml1-20000126
wd-htm~1 <DIR> 04-01-01 9:55a WD-html-in-xml-19981205
wd-htm~2 <DIR> 04-01-01 9:55a WD-html-in-xml-19990224
wd-htm~3 <DIR> 04-01-01 9:56a WD-html-in-xml-19990304
wd-htm~4 <DIR> 04-01-01 9:56a WD-html40-970708
wd-htm~5 <DIR> 04-01-01 9:56a WD-html40-970917
wd-xht~1 <DIR> 04-01-01 9:56a WD-xhtml1-19991124
cougar <DIR> 04-01-01 9:56a cougar
mod <DIR> 04-01-01 9:56a mod
old <DIR> 04-01-01 9:56a old
pro <DIR> 04-01-01 9:56a pro
sp-1 3 <DIR> 04-01-01 9:56a sp-1.3
spyglass <DIR> 04-01-01 9:56a spyglass
My one caveat is that I can imagine a possibility that a browser might do something unexpected if it sees an unknown DOCTYPE at the beginning.
My one caveat is that I can imagine a possibility that a browser might do something unexpected if it sees an unknown DOCTYPE at the beginning.
Wouldn't you just use the same DOCTYPE declaration and just point the browser to a different DTD file maintained on your own server?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.mydomain.com/loose.dtd">