|Using Twitter API to get user timeline|
Japanese symbols in updates are incorrectly transformed
I have a client who wants me to include his Twitter timeline on his website. That seems simple enough, and it was, with one exception. A little background first:
I grab a new update from Twitter via one of their APIs every few minutes and save it to an .xml file since the data is already in XML format.
Then I load the .xml file and look for the created_at and text child nodes of each statuses/status parent node, read them using the text method of selectSingleNode, and then write them to the text file that I include on the web page. The text file is pre-formatted with DIV tags and related stylesheet classes for both child nodes.
Okay, here's the problem. The text child node often contains Japanese symbols in Unicode format (&#nnnnn;) and the process of reading them from the .xml file seems to convert them into ASCII characters which then appear on the web page as question marks instead of Japanese symbols.
My question: Is there a way of preserving the Unicode format of the Japanese characters when reading them from the .xml file?
If you can, convert those Unicode characters to UTF-8, then make sure your page uses UTF-8 encoding. There are lots of tutorials on the Interweb describing how to do that. Even then, the display of Japanese characters is dependent on the font family and browser settings. Naturally someone who browses in Japanese would see them no problem, but if your localization settings are all en-us, you might see ? and those annoying little rectangles. I've had good experiences with UTF-8 characters in basic faces like Arial, Times New Roman, Tahoma, etc.
|If you can, convert those Unicode characters to UTF-8, then make sure your page uses UTF-8 encoding. |
Thanks for your reply. I spent several hours yesterday trying to find examples of this without success.
Everything I tried failed because it converted, for example, `, into an ASCII character, which when displayed on a web page looked like a question mark, or worse, those squares you mentioned.
If I pasted the text exactly as it was in the XML format I got from Twitter, which was in the format, & # 12414, without the spaces I had to add to accommodate WebmasterWorld's software, it showed up as Japanese characters. The minute I would read from the XML file it would get converted to an ASCII character that displayed as a question mark on a web page.
Anyway, that's all moot at this point because I had to get the job done, so I developed my own approach to the problem which was reading the XML file as a text file, which preserved the formatting, and parsing it using regular expressions. The file is very short, the most recent 20 tweets, and my method works perfectly. This is not how I would have preferred doing it, but at least it's done and I can't see any flaws with my method.
But, if someone can provide an example of how to preserve the Unicode format, I'd love to see it and modify my approach to this problem. Thanks.
hey, if it works... good job GaryK :)
There are no flaws with parsing XML with a regex, as long as your XML is reliably consistent - and short - and your regex is sufficiently precise. I mean, it's not philosophically wrong to treat XML like a string instead of an object.
so, it was reading the XML (what method were you using?) that transformed all those funky characters into Unicode &#nnnn; entities. I wonder if that's a flaw in the parser, 'cuz if that content isn't in a <![CDATA[ ]]> node, the XML will become invalid.
I'm not sure what the correct term is for how I was reading it. I used VBScript and the MSXML2.DOMDocument COM object to do all of it because I'm working with ASP Classic.
I used XMLDocument.Load to load the xml file.
Then I used Set XMLStatusNodes = XMLDocument.selectNodes("statuses/status") to grab the status updates.
Followed by looping thru the status updates using For Each XMLStatusNode In XMLStatusNodes
Next I used XMLStatusNode.selectSingleNode("created_at").text and XMLStatusNode.selectSingleNode("text").text to grab the date and update text for each XMLStatusNode.
And that's where to problem started. When the text node contained what I'm calling Unicode characters: &#nnnn; they would get converted to ASCII characters. And the ASCII characters would be displayed as question marks on the web page.
Honestly, I'm not very knowledgeable about XML files other than very basic stuff. I don't know what the <![CDATA[ ]]> node is all about. Never saw it before. But I copied/pasted it to my notes file to learn more about it in the morning.
I have no idea how consistent the XML is. I'm guessing very consistent because there are so many third-party products that work with Twitter. It would be quite annoying to make all of them push out updates on a regular basis.
Finally, thanks for the "good job!"