Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: incrediBILL
Dreamweaver is no better than Notepad++ or Bluefish for cleaning it up.
Wouldn't a global search and replace clean it up just fine; by the page or throughout the site? The main problem would be how many garbage cleanups will be necessary. You can only search and replace one problem at a time and there could be a lot of problems.
These were archived emails generated by outlook that needed to be displayed on a webpage. The finished script looks like this:
//discard unwanted tags
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//strip header stuff
$text = stristr($text, '<P');
//strip all attributes (Word garbage)
$text = preg_replace("/<(\w)[^>]*?>/s", "<$1>", $text);
//get rid of useless non breaking spaces
$text = preg_replace("/ /", "", $text);
//get rid of empty p's
$text = preg_replace("/<p><\/p>/i", "", $text);
$text = mb_convert_encoding($text, "EUCJP-WIN", "UTF-8");
So far it is working like a champ, although some special MS Characters (em-dash, curly quotes, etc.) are stripped completely.
Be advised, this script takes it all down to the most basic of markup, and no attributes are left untouched!
The main problem would be how many garbage cleanups will be necessary.That's exactly the problem. The o: is the simplest bit to clean up; the document, however, will probably be riddled with a zillion <span>s for inline declarations of fonts and margins.
When someone sends me a Word document to mark up, I actually forward it to my GMail account and use the "View as HTML" option. Google strips most of the Office markup and replaces it with simple tags which are easier to find and edit as needed.
Well, it may not be that simple but it gets you very close. No matter what you do, you will need to inspect each and every byte of code in the process. I've always found the sure fire way is to just cut and paste into Notepad++, paste back into FrontPage and then do my structuring and styling from there. I've seen pasted Word documents generate upwards of 2000% more HTML code depending on the structure of the Word doc. If you get a Word Author who knows their stuff and used all the nifty little features in Word, oh boy, watch out!
Not only will you get the HTML code bloat, you'll also have some accompanying CSS embedded in your <head></head> along with being dispersed throughout the document. What a mess that stuff creates.
That's why we like plain <textarea>s for editing. Teach the authors to use basic HTML and minimize all the code bloat that the WYSIWYG Editors are going to create from this whole cut and paste routine. Either that, or strip away all the bells and whistles and tell them they CANNOT perform any cut and paste routines without first going through a program like Notepad or something similar. All that embedded HTML needs to get stripped and there really is only one sure fire way to do it.
The default system Notepad is just fine, I use Notepad++. All these bells and whistles get people into trouble, don't they? :)