Forum Moderators: coopster

Message Too Old, No Replies

Best way to deal with non-ascii chars

Parsing for oddball characters....

         

trillianjedi

3:48 pm on Sep 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got this home-brew script that adds articles to an RSS feed.

Occasionally, if I copy-paste from somewhere into the text box I end up invalidating my RSS feed because I've copied in some strange character. I've never fully worked out why this is.

Example from today "can't" (which looks correct in the browser when I post) becomes:-

canā\x80\x99t

I don't know where it comes from, but is there any way I can code around it?

I'm thinking perhaps if, when posting, I parse the entire string one character at a time looking for ORD values within a certain range [64..whateveritis]?

I don't need code - that I can do, but some help in identifiying what this is and the best approach to dealing with it would be mighty handy :)

TJ

coopster

7:16 pm on Sep 21, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You are cutting and pasting text into a <textarea>? If so, have you checked the data in the POST variable on it's way back to you? Is it OK there or no?

jatar_k

7:18 pm on Sep 21, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



is this only something you do or does it allow input from other users?

you should know that you paste into a texteditor first before pasting into a web form ;)

trillianjedi

7:20 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You are cutting and pasting text into a <textarea>? If so, have you checked the data in the POST variable on it's way back to you? Is it OK there or no?

Yes - pasting into a textarea. Untested in the POST var, that's what I want to do now.

text editor

Yes, doing that would fix it, but that's exactly what I seek to avoid having to do.

Just me using it - so can be pretty rough/buggy.

coopster

2:02 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I would look at two things then. Go ahead and cut/paste into your <textarea> and POST your page. On the processing page, dump two things to your browser before doing anything else.
  1. Look at your $_SERVER superglobal to see what charset and language your browser is using.
  2. Look at the data in the $_POST['yourTextArea'] variable to see how it looks.

Maybe this will get you started ...

trillianjedi

2:21 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks coop. I'll try that and see what I get.

Look at your $_SERVER superglobal to see what charset and language your browser is using.

Can I force that somehow, so that the textarea is forced into regular old ASCII or something?

coopster

2:37 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I don't believe so, not on the client end of things.

There is an

IMPLIED accept-charset
attribute in the FORM [w3.org] element you could try playing with but I don't think that is going to get you anywhere. I've tried other tricks with the
IMPLIED
attributes before but got nowhere. I'm skeptic about this even being a possibility so I'm not even sure why I'm mentioning it ;-)

I think the biggest issue is determining what the raw POST data looks like, does the

can“t
word come over as you expected?

lmo4103

4:02 pm on Sep 22, 2006 (gmt 0)

10+ Year Member



Was "can't" pasted from a windows application that uses win1252 a.k.a. cp1252 a.k.a. windows latin? Windows took the lead in extending the character set and the web did not follow. Maybe the apostrophe in "can't" is a special "windows" curling apostrophe that displays properly if you have your character encoding set just right, but whacko if displayed in utf-8. If so, there is the possibility of translating it, but it is an incredible pain.

trillianjedi

1:31 pm on Oct 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Look at your $_SERVER superglobal to see what charset and language your browser is using

Turned out to be a good spot Coop - I was declaring to the browser that I was sending ISO-8559-1 and then promptly serving UTF-8.

Oops :)

Now fixed. Thanks.

TJ