Forum Moderators: coopster

Message Too Old, No Replies

preg match using charset GBK and GB2312

preg_match GBK GB2312

         

mmirza

8:17 pm on Apr 9, 2008 (gmt 0)

10+ Year Member



I need to parse html files with charsets GBK and GB2312.

I am using preg_match to parse filese in UTF-8 and all Latin and foreign language characters are read correctly using it. However, when I parse files with charsets GBK and GB2312 the foreign characters read are not the correct unicode values.

Is there some setting that needs to be changed for preg_match to process the non-UTF-8 charset text differently?

Thanks!
Mansoor

eelixduppy

8:49 pm on Apr 15, 2008 (gmt 0)



Hello and Welcome to WebmasterWorld!

I'm not sure you can do what you want with regex. Can you give the code example that you are working with, please?

mmirza

2:59 pm on Apr 17, 2008 (gmt 0)

10+ Year Member



The html files I need to parse being with:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gbk"/>
</head>
...

The current code I have uses preg_match to match text in the page that is of interest. An example call is:

preg_match($pattern, $html, $matches)

If I use any chinese characters from the html page in my regular expression, the resulting regular expression does not yeild any matches.

If I use a regular expression comprising html tags, the chinese characters in the text returned by preg_match is different than the characters that are shown in Firefox.

I suspect that this difference is because of the encoding of the html text. My code works fine with non-english and chinese text encoded in UTF-8.

Is there someway I can convert from one charset to another?

Thanks!
Mansoor

mmirza

6:11 pm on May 6, 2008 (gmt 0)

10+ Year Member



I found a solution for this issue - I used iconv() - [uk3.php.net...] - to convert non-UTF8 pages to UTF8 before calling preg_match on them.