Forum Moderators: mack
normal false none style definitions table msonormaltable mso name tstyle rowband size colband noshow priority qformat parent padding alt 0in 4pt para margin top right bottom 0pt left line height 115 pagination widow orphan font family calibri sans serif ascii theme minor latin fareast times new roman hansi microsoftinternetexplorer4
How do I prevent google from viewing these as keywords? Our competitor's site only shows 146
Those are all proprietary "codes" generated from (I'm presuming) a Microsoft Office application such as Word. If you were to view source of your code, you would see it's clogged up with anywhere from 50% to 75% "code" compared to content. So basically, the search engines do not understand these "codes" as they are outside the normal set of elements for HTML or even XHTML, and are indexing them as page content.
The solution? You probably won't like it . . .
Don't use MS office to create your pages. Even Dreamweaver will create a "cleaner" version of your site with no apparent "visual" difference. This will expose the real keywords and content of your site to the search engines.
Sometimes it is unavoidable. I recently had to deal with this very issue and came up with a php solution.
[webmasterworld.com...]
[webmasterworld.com...]
function cleanUpHTML($text)
{
//discard unwanted tags
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//strip header stuff
$text = stristr($text, '<P');
//strip all attributes (Word garbage)
$text = preg_replace("/<(\w)[^>]*?>/s", "<$1>", $text);
//get rid of useless non breaking spaces
$text = preg_replace("/ /", "", $text);
//get rid of empty p's
$text = preg_replace("/<p><\/p>/i", "", $text);
$text = mb_convert_encoding($text, "EUCJP-WIN", "UTF-8");
return $text;
}
The above function was specific to my needs and will probably need tweaking for other applications, but it is working admirably for its intended task.
or even save as SIMPLE HTML...
Do you mean "filtered" HTML, which purportedly removes word specific tags?
Well, it doesn't. Here is an example:
<p class=MsoNormal>The quick <b>brown</b> <i>fox</i> <u>jumps</u> over the lazy
dog…</p>
<ol style='margin-top:0in' start=1 type=1>
<li class=MsoNormal>once</li>
<li class=MsoNormal>twice </li>
<li class=MsoNormal>thrice</li>
</ol>
<p class=MsoNormal> </p>
<ol style='margin-top:0in' start=3 type=1>
<ul style='margin-top:0in' type=disc>
<li class=MsoNormal>amazing what we can do here</li>
</ul>
</ol>
Problems to be seen in the above:
1. Declaration of the class MsoNormal (multiple times), sans quotes, and quite likely not in your style sheet.
2. Failure to properly nest the unordered list within the ordered list.
3. Unwanted in-line styling
4. Empty <p> tags - OK, not empty, but a spurious in there
Just what is that class MsoNormal that we can't seem to be able to get away from?
margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";
Save it as plain text? Now you lose all formatting. This:
<ol>
<li>A list item</li>
</ol>
becomes this:
1. A list item
and this:
<ul>
<li>A list item</li>
</ul>
becomes this:
* A list item
But, if you cut and paste from Word into note pad, that UL list item brings with it a disc:
•A list item
None of the above is correct markup. If one believes that correct sematic markup influences rankings (I do) then Word should be avoided at all costs.
It is absolutely crazy making. I am associated with a site that has several content contributors - most of which simply will not move off of Word for their content creation, no matter how easy I make it for them.
They cut and paste, then email me because the page is broken. I end up having to go in and remove all the span, font and spurious css garbage, reformat lists, headings, etc with proper markup, etc.
But Word's ubiquitousness and "ease of use" prevails no matter how I approach the problem...
Thus that bit of php code I shared, since it automates 90% of the cleanup.
[edited by: brotherhood_of_LAN at 10:30 am (utc) on Mar. 16, 2009]
[edit reason] No personal URLs as per the ToS, use generics. Thanks. [/edit]