My first response, before even looking at the screenshots, was: It isn't enough to look at the html code. That's simply what is sent to your browser. You need to know how the html was created in the first place, whether hand-rolled or made by custom php or a CMS.
My second response, after a cursory glance at the screenshots, was: Oh, I get it, all the non-ASCII characters are getting converted to decimal entities. (Interestingly there was a post just a few days ago leading to the discovery that some mobiles don't "do" decimal entities; it has to be hexadecimal.)
In fact it's worse: about half of the ordinary letters are being rendered as decimal code for absolutely no reason. I really, really hope this isn't happening in the server, as it means that about every other letter takes up seven times as many bytes as it needs to.
The encoding is random, rather than consistent: here an "o", there a "о" and similar.
One of your two screenshots shows the text as-is-- except for the somewhat glaring issue of <br> where <p></p> seems to be warranted. So what's wrong with starting with that version?
In any case, there's no earthly reason why ordinary ASCII characters would be converted into decimal entities. You may need to lean a little harder on your writer.
Thank you for your response very much.
All the text I have got is in word file, .docx not from the server...
as for "One of your two screenshots shows the text as-is-- except for the somewhat glaring issue of <br> where <p></p> seems to be warranted. So what's wrong with starting with that version? "
The first screenshot is not normal, just it is seen on one computer's Dreamveawer and the second screenshot on another one's .
So what should I do to convert all the text in normal coding? to avoid problems with search engines?
Have you tried to copy from the browser screen and paste into an actual text editor such as Notepad++? That might possibly give you uniformity that could then be pasted into the appropriate places on your .html document. I have nothing comparable to try it out with, so this is just a basic idea. It is the first thing I'd try to use to fix it if it happened to me.
One other idea - can you open the .docx file and save it as a .txt document?
Yes, I have tried both of them... nothing helps...
Please try to convert the text below to decent coding or tell me how to do it.
Here is the sample of the text (which is copied from the .docx file), I try to convert in uniform coding:
"Last minutе trаvеl - whеrе tо find thе bеѕt deals
Sоmеtimеѕ if timе iѕ оn your ѕidе the vеrу bеѕt wау to trаvеl iѕ Lаѕt Minutе Trаvеl. Yоu can оftеn find thе very bеѕt сhеар trаvеl dеаlѕ this wау. Oftеn you саn find mаnу such deals to еvеn fаrаwау places likе Mаlауѕiа оr Singapore or еvеn some оf your оthеr drеаm lосаtiоnѕ. In fact more оftеn than none the faraway рlасеѕ offer thе bеѕt vаluе fоr mоnеу whеn it iѕ a lаѕt minutе travel dеаl.
Rеаѕоnѕ Why Lаѕt minute trаvеl dеаlѕ аrе ѕо good include:- "
I was going to say:
For heaven's sake. All you have to do is paste the text-- including entities-- into any text editor with an HTML preview function.
But then things get interesting as I realize belatedly that those are not ASCII character entities. I'm really sorry I didn't home in on this in the first place, because it should have been obvious.
What you've got is decimal characters in the 107x-110x range, corresponding to hexadecimal 04xx. Those are Cyrillic letters that happen to have the same letterforms as assorted Roman letters. And there's no legitimate reason for that to happen. There's a CJK process that's loosely analogous and can be legitimate. But this? Nuh-uh.
So there's something going on that either your designer isn't telling you, or you're not telling us.
Is the writer in an area where they might be using a PC configured for the Cyrillic characters by default? Or perhaps does occasional work with that setting? If so they should be able to adjust that and send it back in whatever format you use. If your machine is not and theirs is not, I would triple check for malware at both ends.
|some articles from the writer |
|a PC configured for the Cyrillic characters by default |
I really doubt that's the explanation. If so, you'd be seeing randomized garbage of a very distinctive kind, where one 1-byte encoding is getting interpreted as a different 1-byte encoding. These are clearly unicode characters. It's especially striking in the first screenshot-- once you know what you're looking for-- because the non-Roman characters are in a different font. In fact the shoe didn't drop for me until I tried pasting from html preview back into this thread; SEE's www preview happens to use a serif font in which Cyrillic is extremely similar to Roman in overall size and shape.
A DOCX file is a compressed file and newer version of DOC files. I would save it first as a DOC file for Word 95/2003 making sure the language and character set is Western or UTF8.
Many versions of Word have the option 'Save As Filtered HTML' in the 'Save As' dialogue. This will filter out all the proprietary MS inline CSS and XML
There are online pages that do the same all in one. A few add a site footer, beware!
Thank you all for the feedbacks!
I solved the problem by converting the docx to jpg and then jpg to text!
and now I can use finally this text!
P.s. the writer may be in Russia, but I'm not sure.
p.s.2. It is fairly possible that the PC of the writer is infected because he sent me(over fiverr) several times the articles in .doc, docx, .pdf and all the same coding errors happen.
Huh what? You mean you made images of the text and then did OCR on it? That sounds like the tag end of a Clients From Hell story :)
I'd think it would be faster to globally replace the entities: thing-that-looks-like-e into "e", thing-that-looks-like-o into "o" and so on. There are only about half a dozen different ones, mainly vowels. But I guess it can't have been that time-consuming, if you've already done it.
Don't overlook the File Save as SIMPLE HTML which WORD or OFFICE allows. If you are not using a MS valid program you'll get very strange results. Your best move is to download from the Microsoft website the docx VIEWER (it's free). It appears you don't have that minimum app in use, or you are attempting to open docx in an older version of Word.