I need a regex expert! - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

I need a regex expert!

andytwiz

10:01 pm on Mar 31, 2005 (gmt 0)

10+ Year Member

Hi

I've been working on this problem for ages now. I'm sure I'll finally crack it but I thought if there is a regex expert out there who could whip it up in a matter of seconds that'd be much more economical!

Here is my full problem (don't worry this isnt a school assignment its from my own software)

Users can specify "templates" which need to be replaced in any text with the "result"

I first need to extract the template format (which is user defined) and then use this template to match and replace within text (I've been doing it with preg_replace_callback())

Template Format:

1) Start with ( { [ or < and end with the matching closing ) } ] or > (hereby all refered to as brackets)

2) After the bracket may be any character(s) but they must be repeated at the end before the closing bracket

3) There COULD be space(s) after 1 and 2

4) At least one ALL CAPITAL word must follow

5) There may be "separators" between more words as in 4). Separators are non alpahbetic character and must ALL be the same

Example Templates:

( TEST )
{@TEST¦HELLO ¦ MORE @} [the separator is ¦]

Invalid Tempaltes
<*YOU@> [no matching * before >]

Once the template format has been determined (i.e. the start code e.g. {@ separator e.g. ¦ and end code e.g. @) text must be searched for any template matching this and the matching words extracted and replaced with a specified string

e.g.
Template: {@TEST¦HELLO ¦ MORE @}
Replacement: [this MORE is the TEST replacement HELLO]

where TEST, HELLO and MORE are found in the text and replaced in the replacement according to the template order.

Ok, I hope this is clear and someone can be bothered to read it and hopefully respond with a super good solution!

Cheers all

ergophobe

5:26 pm on Apr 1, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I can't for the life of me get the backreference to a bracket to match.

For example, this matches fine

$tpl = "This is a A%TEMPLATE%A string";
$pattern = "/(.*)(A)([^a-zA-Z0-9])(TEMPLATE)(\\3)(\\2)(.*)/";
preg_match($pattern,$tpl,$match);

And this matches fine

$tpl = "This is a [%TEMPLATE%] string";
$pattern = "/(.*)(\[)([^a-zA-Z0-9])(TEMPLATE)(\\3)(\])(.*)/";
preg_match($pattern,$tpl,$match);

But no matter how many slashes and so on I use to try to escape it, I can't get anything like this to match:

$tpl = "This is a [%TEMPLATE%] string";
$pattern = "/(.*)(\[)([^a-zA-Z0-9])(TEMPLATE)(\\3)(\\2)(.*)/";
preg_match($pattern,$tpl,$match);

It's starting to bug me! Hopefully the elusive regex expert will show...

andytwiz

12:17 am on Apr 2, 2005 (gmt 0)

10+ Year Member

I've been having a similar problem along the way - I can get references e.g \\2 to work fine, but [^\\2] just doesn't work.. why?

killroy

10:52 am on Apr 2, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

the problem with negative backreferences has been long known, and htey are best avoided.

Take this example:

ABC matches /A(?BC).*/ so you would expect it to NOT match /A(?!BC).*/ but it still will, why? because it will match the A, then it will take the Band match it to (?!BC) since B is not equal to BC it will continue and place the C into the .* part.

The only way to make this work is to have the backreferenced part clearly delimited, for example: /A..(?<!BC).*/ now it is forced to match BC onto the two dots and then it can apply the negative back reference.

So I supose the best to try is to match the exact characters and then to a negative reverse backreference (?!) afterwards.

Hope that helps ;)

SN

andytwiz

3:01 pm on Apr 2, 2005 (gmt 0)

10+ Year Member

thanks for your reply and explanation.

im not quite sure how I implement this though (sorry not been using regular expressions long).

The exact code im trying to use a negative back reference in is:

"/([(¦{¦\[¦<]([^a-zA-Z]*))([^\\2]+)([\\2]*[)¦}¦\]¦>])/";

Could you please let me know how to fix that?

killroy

4:20 pm on Apr 2, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

[^\2] is never going to work, since [] indicate a character class, not a string. So [^\2] means one character which is NOT "\" or "2", surely not what you want.

(?!\2) means the following part cannot be equal to \2 and (?<!\2) means the previous part cannot be equal to \2.

[(�{�\[�<] That bit makers no sense to me or a regex compiler. I'm not entirely sure what you are trying to achieve. I think you should review some regex basics. Here are some guidelines:

[] always represent a character class. they mean one character as described inside the braces. So [abc] means one character which is either "a" OR "b" OR "c". The OR is implicit. Multiple characters are ignored, so [abc] is the same as [aabbcc]. most special characters are treated as literals inside [] since it's a character class there is no use for special characters (except exceptions such as "\n"). So from this following, [a�b�c] actually is the same as [abc�] and means one character either "a" OR "b" OR "c" OR "�". also if you want to include th eclosing brace ] in the class it has to be escaped, so [\]] means one character which is "]" and [^\]] means one character which isn't "]".

() encloses a group of patterns which can be backreferenced. So (abc) matches the 3 character string "abc" and stores it in a back reference, while (abc�def) stores a 3 character string which is EITHER "abc" OR "def".

If u want to use () just for grouping but not as back reference you can add the modifier?: So (?:ab�de) means either "ab" or "de" but does not add it to the back references.

Also "?" means 0 or 1 "*" means 0 or more and "+" means 1 or more.

Since matching braces are not the same character you cannot use backreferences for them and u will have to write special cases for each, unless u don't mind mixing braces, like (@string@} for example.

Also you said "any character(s)" after the brace. This is impossible, because for [ABCBA] u don't know if it's [A BCB A] or [AB C BA]. So I would sugest to decide, either a single character (any) or a string not beginning with a capital character and not white space, since the first word inside the braces has to be a capital word.

Ok, let's see if I can write something that does what u want it to do...

\(([^\s]?)\s*([A-Z]+(?:\s*([^a-zA-Z])\s*[A-Z]+(?:\s*\3\s*[A-Z]+)*)?)\s*\1\)

Ok this I tested on these expressions:


MATCH ( ABC )
MATCH ( ABC � DEF )
MATCH (ABC�DEF)
MATCH ( ABC � DEF � GHI )
MATCH (ABC�DEF�GHI)
MATCH (@ABC�DEF@)
NO MATCH (@ABC�DEF#)    Different Bracket characters
NO MATCH (@ABC�DEF/EFG#)   Different word separators
MATCH ( ABC)
MATCH (? ABC �BLAH� GHI?)
NO MATCH (? ABC � blah � GHI?) Not all words are in capitals
NO MATCH (?ABC�blah�GHI?)  Not all words are in capitals
MATCH (?ABC�BLAH�GHI?)

Again since opening and closing braces are different ([!=] and {!=}) we have to repeat the pattern for each typ of brace and enclose the whole in (?:) so we don't add new captures subpatterns and seperate them with "�".

So this is the final pattern:

\(([^\s]?)\s*([A-Z]+(?:\s*([^a-zA-Z])\s*[A-Z]+(?:\s*\3\s*[A-Z]+)*)?)\s*\1\)
�\{([^\s]?)\s*([A-Z]+(?:\s*([^a-zA-Z])\s*[A-Z]+(?:\s*\3\s*[A-Z]+)*)?)\s*\1\}
�\[([^\s]?)\s*([A-Z]+(?:\s*([^a-zA-Z])\s*[A-Z]+(?:\s*\3\s*[A-Z]+)*)?)\s*\1\]
�\<([^\s]?)\s*([A-Z]+(?:\s*([^a-zA-Z])\s*[A-Z]+(?:\s*\3\s*[A-Z]+)*)?)\s*\1\>

Now to use this in an environment that encloses the patter in // such as JS or PHP you will ahve to escape the slashes inside the pattern.

Of course if you don't need the word separators to be teh same character the pattern can be greatly simplified. Currently it accepts any non white-space letter as bracket letter, so [ABA] is see as [A B A] NOT [ ABA ]. if you want to allow longer strings for the bracketing you will also have to modify the pattern. As it is, this string "<?php Hello World php?>" Is split like this: "<?" "php Hello World php" "?>" which may not be what you want.

Let me know if you are unclear about anything and I'll try to clarify.

Regards,
Sven

[edited by: ergophobe at 4:41 pm (utc) on April 2, 2005]
[edit reason] fixed sidescroll [/edit]

andytwiz

5:05 pm on Apr 2, 2005 (gmt 0)

10+ Year Member

Thanks that's fantastic.

You were right, I didnt really understand the [] but I do now :D

killroy

5:11 pm on Apr 2, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

And BTW, (A¦B¦C) means the same as [ABC]

SN

ergophobe

5:17 pm on Apr 2, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Argghhh I'm an idiot! Obviously

$pattern = "/(.*)(\[)([^a-zA-Z0-9])(TEMPLATE)(\\3)(\\2)(.*)/";

is not going to match as it would only match

$tpl = "This is a [%TEMPLATE%[ string";

I don't know what I was [not] thinking!