Do you mean "filtered" HTML, which purportedly removes word specific tags?
Well, it doesn't. Here is an example:
<p class=MsoNormal>The quick <b>brown</b> <i>fox</i> <u>jumps</u> over the lazy
dog…</p>
<ol style='margin-top:0in' start=1 type=1>
<li class=MsoNormal>once</li>
<li class=MsoNormal>twice </li>
<li class=MsoNormal>thrice</li>
</ol>
<p class=MsoNormal> </p>
<ol style='margin-top:0in' start=3 type=1>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal>amazing what we can do here</li>
</ul>
</ol>
Problems to be seen in the above:
1. Declaration of the class MsoNormal (multiple times), sans quotes, and quite likely not in your style sheet.
2. Failure to properly nest the unordered list within the ordered list.
3. Unwanted in-line styling
4. Empty <p> tags - OK, not empty, but a spurious in there
Just what is that class MsoNormal that we can't seem to be able to get away from?
margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";
Save it as plain text? Now you lose all formatting. This:
<ol>
<li>A list item</li>
</ol>
becomes this:
1. A list item
and this:
<ul>
<li>A list item</li>
</ul>
becomes this:
* A list item
But, if you cut and paste from Word into note pad, that UL list item brings with it a disc:
•A list item
None of the above is correct markup. If one believes that correct sematic markup influences rankings (I do) then Word should be avoided at all costs.
It is absolutely crazy making. I am associated with a site that has several content contributors - most of which simply will not move off of Word for their content creation, no matter how easy I make it for them.
They cut and paste, then email me because the page is broken. I end up having to go in and remove all the span, font and spurious css garbage, reformat lists, headings, etc with proper markup, etc.
But Word's ubiquitousness and "ease of use" prevails no matter how I approach the problem...
Thus that bit of php code I shared, since it automates 90% of the cleanup.