|MS Word is Driving me Crazy with Converted Characters|
This has been posted about before (I've checked, but I want to solve the problem on the MS Word side). Long story short:
1) My business partner manages freelancer writers that work for us.
2) Everyone (business partner and freelancers) insists on using MS Word (hey, it's great to write printed stuff...not so much for anything that's going on the Web).
3) Much of this content is for an SEO company - putting the copy online (MySQL database) results in all kinds of problems with all the characters conversions that Word does (quotes, apostrophes, en-dash, em-dash, ellipses) even with UTF-8 content type specified.
4) I'm can't really fix this problem with content type declarations and changing MySQL charsets because I have no direct control over such things.
Bottom line - I need a way to clean up all this stuff in Word AFTER it's produced. Easy enough "find and replace" on the quotes, but there's really no easy way to rid myself of the other ugly characters. Has anyone solved this problem at the source before? i.e. - received "bad" Word files and converted the junk in a consistent, foolproof manner?
Any suggestions are terribly welcome...I'm at my wit's end here!
Thanks in advance,
What I do, copy the entire page, then paste into Notepad .. then copy from Notepad and paste into my HTML editor.
I don't know if it handles everything, but it simplifies the issues.
Yes, that's what I do too. You have to go back and do some formatting again but it's the only solution I know.
There must be a way of doing this aurtomatically. One of the online editors has a "Clean up MS Word" option when pasting from Word. Is it Flyspeck?
I do that too, but if the character encoding is wrong, it's wrong. Cutting and pasting like that doesn't help.
Basically, all of the ones you mention - curly or "smart" quotes, dashes, etc, are code points that do not match between Windows-1252 and Unicode. I've had this a fair bit where people are working in Windows-1252 (though becoming rare now that both Windows and Office default to Unicode). That's what I would look at though.
- look at my long post (2nd to last of mine, 4th from the bottom of the thread I think)
- the links that coopster gives in the very last post
System: The following message was spliced on to this thread from: http://www.webmasterworld.com/content_copywriting/3884979.htm [webmasterworld.com] by coopster - 11:48 am on April 3, 2009 (utc -6)
I've haven't read all that referenced posts yet, but I will. I'm also doing further research into the matter and will write back with things I've found if anything is useful. Copy/paste to notepad does work for some items, but not all (again, the main culprits from Word always end up wrong). In Word itself, you can turn "OFF" the automatic conversion to smart quotes and the en-dash and em-dash features (Tools > AutoCorrect Options > "AutoFormat" & "AutoFormat As You Type" tabs). That may help some, but I'm not actually producing this content, of course (I use Notepad++) and I don't trust others to remember to turn those features off, esp. if they regularly turn them back "on" for other projects.
From inside MS Word
File: save as plain text
when the file conversion dialog pops up, check MS-DOS
you'll get a warning that the text shown in red will not save correctly in the chosen encoding
check the "Allow character substitution"
the nasties are converted to web usable characters
That's always worked for me
CrustyAdmin...such a simple, elegant solution. Now, why couldn't I have figured that out? Thank you for that tip - it's simpler than the one I've found on my own (just a short while ago).
The "other" solution - I have also found something else that works. Copying from Word (with messed up characters) into Notetab Light removes these "bad" characters:
It DOESN'T clean up the ellipses (i.e. - three dots...), but it IS the only editor I've thus far found that will actually recognize Word's messed up version of the ellipses in the "find and replace" mode. Every other editor I've tried (Word, Notepad, Notepad++ and others) won't recognize those to do a find and replace (at least not at the default settings - perhaps with a plugin).
Actually I find it quicker to just cut and paste through notepad because I don't have to save and reopen any files or am I misunderstanding these instructions?
notepad won't get rid of all the special characters becasue (I think this is why) it uses a Windows character set.
The "trick" to my solution is the "when the file conversion dialog pops up, check MS-DOS " that forces you to use just plain ascii characters I think, but whatever it does, it gets rid of characters that notepad and most editors don't.
Paste to Notepad is quicker, but is not 100% depending on what special characters are in the doc.
BeeDeeDubbleU and CrustyAdmin - that is my experience exactly, i.e. - Notepad won't always get rid of Word's special characters. added: pardon me, I didn't mean "won't always," I meant "won't get rid of ALL of the bad characters."
[edited by: Canton at 5:06 pm (utc) on April 4, 2009]
I've never tried this .. but what about saving as HTML, then pasting in Notepad to get rid of Word's formatting? Pretty much same-same, I suppose.
I just did a cursory trial and '...' converted to '…' in the design view of Dreamweaver .. not so in the code view. Although when pasted in the code view, it does render as a slightly odd looking elipsis when viewed in Firefox.
I wrote a macro in "TextPad" that accomplishes a search and replace of all culprits from MSWord. As stated above, saving to .txt file within Word also works.
You will always need to clean up Word content prior to publishing. There is no way to get around what you are describing. I use Notepad++ in addition to FrontPage which has a Remove Formatting option that I don't use as much anymore. Notepad++ for the initial cleansing. And then find and replace for the curlies, m/n dashes, etc.
One you scrub, then you rebuild the content/formatting in your HTML editor.
Notepad is required as a middle mad to get rid of all the o:p stuff that you will find. And, there is probably more of that instead of content.