Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: not2easy
1) My business partner manages freelancer writers that work for us.
2) Everyone (business partner and freelancers) insists on using MS Word (hey, it's great to write printed stuff...not so much for anything that's going on the Web).
3) Much of this content is for an SEO company - putting the copy online (MySQL database) results in all kinds of problems with all the characters conversions that Word does (quotes, apostrophes, en-dash, em-dash, ellipses) even with UTF-8 content type specified.
4) I'm can't really fix this problem with content type declarations and changing MySQL charsets because I have no direct control over such things.
Bottom line - I need a way to clean up all this stuff in Word AFTER it's produced. Easy enough "find and replace" on the quotes, but there's really no easy way to rid myself of the other ugly characters. Has anyone solved this problem at the source before? i.e. - received "bad" Word files and converted the junk in a consistent, foolproof manner?
Any suggestions are terribly welcome...I'm at my wit's end here!
Thanks in advance,
There must be a way of doing this aurtomatically. One of the online editors has a "Clean up MS Word" option when pasting from Word. Is it Flyspeck?
joined:Apr 25, 2002
Basically, all of the ones you mention - curly or "smart" quotes, dashes, etc, are code points that do not match between Windows-1252 and Unicode. I've had this a fair bit where people are working in Windows-1252 (though becoming rare now that both Windows and Office default to Unicode). That's what I would look at though.
- look at my long post (2nd to last of mine, 4th from the bottom of the thread I think)
- the links that coopster gives in the very last post
That's always worked for me
The "other" solution - I have also found something else that works. Copying from Word (with messed up characters) into Notetab Light removes these "bad" characters:
It DOESN'T clean up the ellipses (i.e. - three dots...), but it IS the only editor I've thus far found that will actually recognize Word's messed up version of the ellipses in the "find and replace" mode. Every other editor I've tried (Word, Notepad, Notepad++ and others) won't recognize those to do a find and replace (at least not at the default settings - perhaps with a plugin).
The "trick" to my solution is the "when the file conversion dialog pops up, check MS-DOS " that forces you to use just plain ascii characters I think, but whatever it does, it gets rid of characters that notepad and most editors don't.
Paste to Notepad is quicker, but is not 100% depending on what special characters are in the doc.
[edited by: Canton at 5:06 pm (utc) on April 4, 2009]
I just did a cursory trial and '...' converted to '…' in the design view of Dreamweaver .. not so in the code view. Although when pasted in the code view, it does render as a slightly odd looking elipsis when viewed in Firefox.
One you scrub, then you rebuild the content/formatting in your HTML editor.
Notepad is required as a middle mad to get rid of all the o:p stuff that you will find. And, there is probably more of that instead of content.