Forum Moderators: coopster
I am using preg_match to parse filese in UTF-8 and all Latin and foreign language characters are read correctly using it. However, when I parse files with charsets GBK and GB2312 the foreign characters read are not the correct unicode values.
Is there some setting that needs to be changed for preg_match to process the non-UTF-8 charset text differently?
Thanks!
Mansoor
I'm not sure you can do what you want with regex. Can you give the code example that you are working with, please?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gbk"/>
</head>
...
The current code I have uses preg_match to match text in the page that is of interest. An example call is:
preg_match($pattern, $html, $matches)
If I use any chinese characters from the html page in my regular expression, the resulting regular expression does not yeild any matches.
If I use a regular expression comprising html tags, the chinese characters in the text returned by preg_match is different than the characters that are shown in Firefox.
I suspect that this difference is because of the encoding of the html text. My code works fine with non-english and chinese text encoded in UTF-8.
Is there someway I can convert from one charset to another?
Thanks!
Mansoor