regex for syntax highlighting

Forum Moderators: coopster

Message Too Old, No Replies

regex for syntax highlighting

zRonin

11:03 pm on Jul 28, 2006 (gmt 0)

I want to find all <code></code> tags and replace the inner content with C++ style syntax highlighting.

I'm having a lot of trouble figuring out how to search for <code>(ANTYHING EXCEPT </code>)</code>.

Here's what I think should work: <code>[^(?:<\/code>)]+<\/code>

Any suggestions on how to get this to work?

[edited by: coopster at 4:12 pm (utc) on July 29, 2006]
[edit reason]
[1][edit reason] Disable graphic smile faces [/edit] [/edit][/1]

dreamcatcher

6:03 am on Jul 29, 2006 (gmt 0)

Hi zRonin,

preg_match [uk2.php.net] is ideal for grabbing data between tags:

$data = '<code>I am some code..</code>';
$code = preg_match( "/<code>(.+)<\/code>/si", $data, $match);
$code = strip_tags($match[1]);
echo $code;

coopster

2:13 pm on Jul 29, 2006 (gmt 0)

You might also consider highlight_string [php.net] if it is PHP code you are highlighting.

adnovice88

4:03 pm on Jul 29, 2006 (gmt 0)

what dreamcatcher gave will work only if there is only one instance of <code>...</code>.

if you have it multiple times, then you need to use something like following:

/\<code\>((?:(?!\<\/code\>).)*)/

my regex is of Perl, but guessing, should work with PHP also.

do test it rigorously.

[edited by: coopster at 4:13 pm (utc) on July 29, 2006]
[edit reason]
[1][edit reason] Disable graphic smile faces [/edit] [/edit][/1]

zRonin

4:15 pm on Jul 29, 2006 (gmt 0)

Thanks for trying, but you're having the same problem I had in the beginning.

Given the following code:

abcdefg<code>hijklmn</code>opqrstu<code>vwxy</code>z

Your regular expression returns hijklmnopqrstuvwxy
I would like it to return hijklmn and vwxy respectively

Is there anything else you can suggest? The code I am highlighting is C++, but I need to run things through my own parser because people on my site need specific highlighting that most C++ users wouldnt.

$code = 'class CMatlabEng {
public:
<code>/* int OutputBuffer(char *p, int n);</code>
void OpenSingleUse(const char *startcmd, void *dcom, int *retstatus);
int GetVisible(bool* value);*/
int SetVisible(bool value); mxArray* GetVariable(const char* name);
<code> int PutVariable(const char *name, const mxArray *mp);
int EvalString(const char* string);
void Open(const char* StartCmd); int Close(); CMatlabEng();
//virtual ~CMatlabEng();</code>

protected:
Engine* pEng;
}; ';

$search = "#<code>(.+)<\/code>#si";

preg_replace_callback($search, 'fetch', str_replace(array("\n"," "," "),array(" "," "," &nbspl  "),$code));

function fetch($matches)
{
print_r($matches[1]);
// return parsed code (cut out for posting)
}

Edit: I have decided that I could convert all < and > within the code, and then use the search string '#<code>([^<]+<\/code>#si' except that seems like it's just applying duct tape, not honestly getting the answer. And what's more, I won't have learned the solution for next time.

coopster

4:21 pm on Jul 29, 2006 (gmt 0)

preg_match is greedy by default and will only return the first matching pattern. If you use preg_match_all() [php.net] you can get each and every one back in an array. But actually, if I were you I would just use preg_replace_callback() [php.net] since you have your own function you want to run things through ...

coopster

4:30 pm on Jul 29, 2006 (gmt 0)

Sorry, didn't see the preg_replace_callback in there (blushes)

You have it all good to go, except you need to use the Ungreedy modifier [php.net] in your pattern:

$search = "#<code>(.+)<\/code>#Uis";

bedlam

4:56 pm on Jul 29, 2006 (gmt 0)

I'll chime in too ;-)

I'm no regex expert, but everything I've done all week long has involved them for some reason...

First of all, the most appropriate function in php for this task is probably preg_replace_callback() [ca3.php.net]--since it allows you to take each match and do some work on it as opposed to preg_match() and preg_match_all() which simply return one or more matches from the original. I suspect it's better for this case too since it's won't just be a simple replacement of one string with another (else we could use preg_replace()).

Having said that, I'm not quite clever enough to have done a tidy job of it--but here's what I came up with:

<?php
$source = 'Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas dictum sapien vitae neque. Maecenas dui. Vestibulum suscipit, magna ut sodales tempor, sem arcu mollis sapien, cursus mollis ipsum ipsum vel erat. Nulla tempor est at mauris. Proin ullamcorper tortor non tellus. Suspendisse pretium dui ut augue. Aliquam vitae mauris. Cras feugiat nulla a metus. Vivamus eu lectus at sem vestibulum pretium. Sed odio sapien, tempus at, venenatis id, faucibus ac, nulla. Aenean congue lacinia tortor. <code>Morbi</code> adipiscing, lorem a facilisis imperdiet, nulla ante venenatis erat, et <foo>nonummy mi</foo> massa sit amet <code>ipsum. Cras</code> et orci. Ut rutrum ultricies sapien. Ut ut urna in mauris ornare tincidunt. Ut magna sem, iaculis condimentum, laoreet et, elementum et, mi.<code>Aenean auctor placerat</code> orci. Quisque blandit sapien eu nisi dictum rhoncus.';
$pattern = '/<code>(.+?)<\/code>/';
function highlight($matches) {
// Manipulate content here--$matches[1] contains the content of the <code> element:
$output = '' . $matches[1] . '';
// Return the altered string:
return '<code>' . $output . '</code>';
}
echo preg_replace_callback($pattern, 'highlight', $source);
?>

It seems to me that it should have been possible to do this (maybe with lookahead?) without having to re-insert the code tags. However, it does seem to work in all the limited contexts I've tried it in--leaves other tags alone etc.

-b

zRonin

6:33 pm on Jul 29, 2006 (gmt 0)

Thanks so much for the replies. I've tested all the suggestions and different variations on them, and the one with the best success is the following:


/\<code\>((?:(?!\<\/code\>).)*)/

(This was the code I was originally trying to get in the beginning, but of course mine was invalid)

[edited by: coopster at 9:47 pm (utc) on July 29, 2006]
[edit reason]
[1][edit reason] Disable graphic smile faces [/edit] [/edit][/1]

coopster

9:47 pm on Jul 29, 2006 (gmt 0)

That's what is called a negative lookahead assertion (without capturing the subpattern). The "U" modifier is doing the same thing but this particular "Ungreedy" modifier is not compatible with Perl so it probably looks different to any perl hackers that come across it -- I know I was at first as I had never seen that particular modifier until I starting developing PHP regular expression patterns.

adnovice88

3:34 am on Jul 30, 2006 (gmt 0)

php's U is pretty simpler and easier than perl's ungreedy way of doing the same.

coopster

9:27 pm on Jul 30, 2006 (gmt 0)

It's actually the quantifiers

(?, * , + , {} )

that are greedy, not the whole pattern. So you can get the same effect in perl (as demonstrated here in PHP) by using the non-greedy versions of the same quantifiers, use

(? , *? , +?, {}?)

instead. Perl regex:

/<code>(.+?)<\/code>/gis

adnovice88

4:52 pm on Jul 31, 2006 (gmt 0)

cool. simplifies my regex a lot. thanx Coopster.