Forum Moderators: coopster

Message Too Old, No Replies

ignore whitespace during search and replace

         

ergophobe

9:06 pm on Apr 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a file which may have string that need replacing, similar to Smarty templates, for example.

{* EMAIL *}

gets replaced with

somebody@somewhere.tld

I want to ignore (or eliminate) any whitespace between the {* *} delimiters, but nowhere else. Basically I'm looking for the equivalent of a regular expression that would not take accout of whitespace, sort of like

$pattern = "/EMAIL/i"

would not take account of case. The /x modifier will not do what I want since it ignores whitespace in the pattern, not in the text being searched.

The only thing I can think of for the moment is to explode, process, and implode the string, but I figure there has to be a better way.

Tom

ergophobe

9:46 pm on Apr 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can, by the way, get this to work using looping and that may be the only way

$string = "adf asdf asdf asdf asd{* EMAIL*} asdfasdf asdf asdf{* E MAIL *}asdf asdf {*E MA I L *}";

//if there's whitespace between {* and *, keep looping
$match = "/\{\*[^\s\*]*\s/i";
while (preg_match($match, $string)) {

// replace one whitespace character
$pattern = "/(\{\*[^\s\*]*)\s/i";
$replace = "$1";
$string = preg_replace($pattern, $replace, $string);

}

// no whitespace, now do the replacement I want
$pattern = "/\{\*EMAIL\*\}/i";
$replace = "email@domain.com";
$string = preg_replace($pattern, $replace, $string);

Since regex is fairly processor intensive, I was hoping to find a better way.

ergophobe

5:13 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



bump

Anyone have a better way?

coopster

6:11 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Better? Hmmm, I don't know if I would go so far as to say better. But you could build your regex to find optional space characters between the lettering you expect, in this case "E-M-A-I-L"...

It isn't pretty...

$pattern = "/\{\s*\*\s*E\s*M\s*A\s*I\s*L\s*\*\s*\}/Uis"; 
$replacement = "email@domain.com";
$newstring = preg_replace($pattern, $replacement, $string);
...but it should do the job.

Note: This regex is also checking for optional white space between the braces and the asterisks...

timster

9:08 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can let PHP build the "spacey" expression itself like this:

$pattern = "{*EMAIL*}"; # or anything
$spaceyPat = patternAllowsSpaces($pattern);
$newstring = preg_replace("/$spaceyPat/", $replacement, $string);

function patternAllowsSpaces($mystring) {
$mystring = preg_quote($mystring);
$mystring = preg_replace("/([^\\\])(?=.)/", "$1\s*" , $mystring);
return $mystring;
}

coopster

9:31 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Nice. I like that.

ergophobe

10:53 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Timster,

Thanks! I'll check that out.

Coopster, thanks for the idea. I should have specified that I won't actually know what the search string is except at run time. I will want to match the value, whatever it is (e.g. EMAIL) with something (probably the name of a constant or perhaps a value in a DB. To make your solution work, I would need to parse the string one character at a time to make the "pattern" part... or use Timster's fancy regex.

Essentially, these are meant to be template variables that the webmaster can use and I'm trying to make it so the script is as liberal as possible in what it accepts.

ergophobe

11:45 pm on Apr 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Okay,

Now looking at it, I have to fess up it's brilliant, I just wish I understood it. I don't understand this part:

[^\\\]

I would assume that the first \ escapes the second and the third escapes the ] which would screw up the character clause. Obviously that's not the case. I get the lookahead, the backreference and all that. I just don't get the character class.

Works great, I just can't break it down correctly.

coopster

12:56 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Yeah, it is very difficult to get a grip on. I posted my research and notes in message #5 of this thread [webmasterworld.com]. Hopefully somebody else can add to it or correct the thought process if it is misleading...

timster

12:58 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[^\\\]

The purpose of that homely little snippet is to match anything that is not a backslash. (The preceding line adds backslashes to the string that have special meaning, so we don't have to add "\s" after them.)

The sqaure brackets make a character class. If you write [a7,] that would match anything that's an "a" a "7" or a "," The ^ (caret) at the beginning off the character class negates it, so it matches anything that isn't in the brackets. Since the backslash has a special meaning, it has to be escaped (with another backslash).

But I confess, I don't really know why PHP demands three backslashes here instead of 2, except to say it seems to have something to do with how the line gets interpreted. (Only 2 backslashes are required in Perl or Grep to do this. Can anyone explain?)

It should be noted that the subroutine I posted won't work properly on any input string that contains a backslash.

coopster

1:20 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Actually, it will, timster if you use the optional second parameter on preg_quote:
$mystring = preg_quote($mystring, '/');

ergophobe

3:50 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I don't really know why PHP demands three backslashes here instead of 2,

That's the part that threw me. I was thinking "The first backslash escapes the second one, so that's 'not backslash', but then the third one escapes the bracket and that screws up the character class".

Coopster's explanation in the thread he referenced makes sense (or let me say, it has a simple rule that's applicable and easy to remember -"PHP parses the string first, then sends it to the regex engine" - whether or not it makes sense is another matter).

Sure enough, if I remove the third \, I get the php warning:

Warning: Compilation failed: missing terminating ] for character class at offset 11

So in other words, that parses out to "\]" and escapes the bracket closing the character class - exactly the effect I expected the three slashes to have!

Whew!

timster

4:06 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Coopster, that line fixes the script for slashes ('/') and I didn't see that was broken.

My warning was about backslashes ('\'). My regex didn't add a "\s*" after a backslash, but I went ahead and fixed that.

function patternAllowsSpaces($mystring) {

$mystring = preg_quote($mystring, '/');
$mystring = preg_replace("/([\\\]{2}¦[^\\\])(?=.)/", "$1\s*" , $mystring);
return $mystring;
}
# [\\\]{2} Matches exactly 2 backslash characters, which means a literal backslash

ergophobe

4:35 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks guys.

I'm glad I posted. I learned a fair bit about some of the obscurities (to me anyway) of regular expressions. I would never have gotten that on my own.

BTW, this thread also makes me think that it's better to single quote replacement strings with backreferences. I hadn't thought before of what would happen in PHP if I had a variable named $1 and a replacement pattern of "$1\s*".

Tom

coopster

4:42 pm on Apr 7, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Nice work, timster. I like the function here. Have saved for future use.

BTW, anybody copying and pasting this for use, don't forget that WebmasterWorld changes the pipe symbol (¦) during the post, you'll have to change it in your code using your keyboard.