Forum Moderators: coopster

Message Too Old, No Replies

preg replace with (this) or (that)

         

csdude55

4:06 am on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm currently running two consecutive replaces:

$str = preg_replace('#\s{2,}#', ' ', $str);
$str = preg_replace('#\s?([=),;+])\s?#', '$1', $str);


I realize that I could mash these in to one by doing:

// using /x here for readability
$str = preg_replace('#
\s?([=),;+])\s? |
(\s){2,}
#x', '$1$2', $str);


Do you see any problem with this? I think it's OK, but using both $1 and $2 like that scares me.

lucy24

5:29 am on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



\s?([=),;+])\s?
Don't you mean
\s*([=),;+])\s*
to account for punct preceded or followed by multiple spaces?

Yeah, I'd test a few sample strings to make sure it doesn't fly into a tizzy when asked to use a nonexistent capture (either $1 or $2). If it simply comes through as "" then all is well.

Got a vague notion I just recently had something similar where it spat out a bit of garbage, but that was in javascript, whose behavior can't be relied on to be identical.

csdude55

6:43 pm on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't you mean...

Good catch, thanks!

So far I'm not seeing any problems in tests, but when I go live and have a million hits there's no telling what'll happen. I can't think of a situation where both $1 and $2 will match, so in theory it should be OK...

Thinking about it, would this be the same?

$str = preg_replace('#
\s*
(
[=),;+] |
\s
)+
\s*
#x', '$1', $str);


I mashed the \s into the first group instead of making a second one, then added a + after the group.

It FEELS the same, and preliminary tests have the same result.

lucy24

6:59 pm on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[=),;+]|\s
Do they have to be separate at all?
\s*([=),;+\s])\s*
It does create a tiny bit of a hiccup if the string is made up entirely of spaces--RegEx picks up all of them and then has to take one step back for “Oh, oops, I was supposed to leave room for the last space”--but you do end up capturing either the punct or a single space, which would be what you want to end up with.

Pattern can contain close-parenthesis but not open-parenthesis? Or is that just an artifact of posting?

In the specific context of this function, does it matter if you end up with a " " space, an nbsp, a line break, a tab, or, or, or? In generated HTML, I guess it really doesn't matter; they'll all come through to the user as “ ".

csdude55

7:13 pm on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do they have to be separate at all?

Touche. I guess not, really.

If I leave the + off of the group, and both the opening and closing \s are optional, then wouldn't it unnecessarily replace every whitespace with a whitespace?

Pattern can contain close-parenthesis but not open-parenthesis? Or is that just an artifact of posting?

The plan here is to include the CSS file on the first load between <style></style> tags, then the bottom of the page loads the CSS file to cache. Then I set a cookie, so subsequent pageviews (cookie exists) just loads CSS via the regular <link> tag.

This shaved about 250ms off of the initial page load... or usable time, I guess I could say.

These regex are really just minimizing the CSS before writing it to the HTML. The only time there are ( ) are when it's something like gradient(...), so it's possible to have a space after the ) but never before.

If it misses a space here or there it's not a tragedy, I'm just trying to make it as small as possible to improve load time.

lucy24

7:25 pm on Sep 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Dang, I was just rushing in to point out one further problem. If your pattern says (blahblah)+ then $1 will be only the first occurrence of blahblah. To get the whole thing, it would have to be ((blahblah)+) or ((?:blahblah)+).

wouldn't it unnecessarily replace every whitespace with a whitespace?
I guess. End result the same, but a smidgen of extra work. If there's an enormous number of them, you could do some benchmark testing to see if there's any meaningful difference.

Anyway! Ah, minifying CSS. Yes, it does help to get a sense of what the function is actually supposed to do. I suspect this is one of those situations where keeping it as two steps would end up being simpler. First replace all \s{2,) (or, at a savings of one byte, \s\s+) with " " since css doesn't care what kind of space it is. And then replace the pair punct + space with punct alone.

csdude55

4:09 am on Sep 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That was how I originally had it, @lucy24, but I thought I might shave a little load time if I could remove one of those preg_replace() functions.

I technically had it as 3:

// convert comments, line breaks, and tabs to a single whitespace
$str = preg_replace('#/\*[^*]*\*+([^/][^*]*\*+)*/|\r\n|\r|\n|\t#', ' ', $str);

// remove repeating whitespace
$str = preg_replace('#\s{2,}#', ' ', $str);

// remove opening or trailing whitespace, or whitespace that's
// following certain characters
$str = preg_replace('#^\s|\s$|\s?([=),;+])\s?#', '$1', $str);

lucy24

3:15 pm on Sep 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If step 1 is to change all spaces to " " then any and all following lines can say " " instead of \s.

I note the \r\n because that was another thing I thought of after last posting: If you're composing the code on a Windows machine, AND it's a Windows-based server, then every line break is actually two characters. So right there you've reduced a lot of characters.

Writing this out to see if the comments package comes out to the same as what you've got:
((/\*+[^*]*\*+/(\s*/\*+[^*]*\*+/)*)
(wow! those asterisks make it look confusing, don't they)
though that lone \s could now be expressed as " " (quotation marks for posting purposes)
and for the last bit you could just say
[\r\n\t]+

Another way to express, er, non-space spaces would be
[^\S ]

Edit to change from [ code ] to [ fixed ] to eliminate confusing unintended italics and color highlighting.

csdude55

6:34 pm on Sep 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've never been 100% about that... I'm coding on Windows, but the server is Linux. Does that mean that I can forget about \r\n? I've used \r\n|\r|\n for as long as I can remember.

Thinking about it, doesn't "\s" include [\n\r\t], as well as ' '?

I guess that, really, since I'm initially converting \r\n|\r|\n|\t to ' ' and then removing duplicates, I could replace them all with \s in the first replace instead of testing for each of them manually.

lucy24

9:18 pm on Sep 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



doesn't "\s" include [\n\r\t], as well as ' '
Yes, that's what I meant by “non-space spaces” :) If you wanted to pick out all space characters except the literal " " you'd have to do something like [^\S ] meaning “neither a non-space nor a literal %20 space".

There may be a setting in whatever program you use for uploading that lets you tell it how to do line breaks. (Digression: Years ago when I was involved with {organization that treated Windows as the norm} I had to set my text editor to \r\n (CRLF) so things wouldn't break at their end. It was very annoying because SubEthaEdit didn't recognize the $ anchor if the immediately following character was \r (CR); it only worked with \n (LF). Many of my RegExes got vastly simpler when I was able to switch to /n alone.)

:: poring over Fetch prefs ::

Huh. You can set line endings for download but not upload. Oh well then.

csdude55

5:10 am on Sep 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was looking at this regex that you posted, but got lost:

// using /x for readability
$str = preg_replace(#
(
(
/\*+
[^*]*
\*+
/
(
\s*
/
\*+
[^*]*
\*+
/
)*
)
)#x, $str)


Was this in lieu of this one?

$str = preg_replace(#
/\*
[^*]*
\*+
(
[^/]
[^*]*
\*+
)*
/#x, $str)


What was the advantage? I obviously see the advantage of using /\*+ (since I do often use /*** comment ***/), but after that I got lost.

BTW, I did a speed test between these two over 1000 iterations:

// First option, 0.95533585548401s
$str = preg_replace('#^ +| +$| *([=),;+ ])+ *#', '$1', $str);
$str = preg_replace('#/\*+[^*]*\*+([^/][^*]*\*+)*/|\s+#', ' ', $str);

// Second option, 0.62886810302734s
$str = preg_replace('#^ +| +$| *([=),;+ ])+ *#', '$1',
preg_replace('#/\*+[^*]*\*+([^/][^*]*\*+)*/|\s+#', ' ', $str)
);


Both are pretty slow, of course, which is why I'm trying to minimize it as much as I can. But the second option (with nested preg_replace()) is consistently faster!

lucy24

5:36 am on Sep 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Huh. I wasn't even sure there IS a difference between your version and mine; I just wrote it out from scratch to see if they'd come out the same. In my version, the optional bit (blahblah)* is to deal with multiple comments in one fell swoop if you ever happen to have two or more in a row.

Incidentally, I just recently discovered that using [ fixed ] instead of [ code ] is often more readable. (I understand color-highlighting, assuming you're in some specific language--not php, evidently--but why the ### would code ever be italicized?) It even blocks auto-linking, just like [ code ] does.

// First option, 0.95533585548401s 
$str = preg_replace('#^ +| +$| *([=),;+ ])+ *#', '$1', $str);
$str = preg_replace('#/\*+[^*]*\*+([^/][^*]*\*+)*/|\s+#', ' ', $str);

// Second option, 0.62886810302734s
$str = preg_replace('#^ +| +$| *([=),;+ ])+ *#', '$1',
preg_replace('#/\*+[^*]*\*+([^/][^*]*\*+)*/|\s+#', ' ', $str)
);
It was all those slashes that prompted me to experiment, since the italics in [ code ] made it look as if there are three kinds of slash.

Fun fact: I actually have no idea what # means in those preg_replace expressions :)

csdude55

5:58 pm on Sep 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Fun fact: I actually have no idea what # means in those preg_replace expressions :)

Haha, sorry about that! I use the # in place of / as the opening and closing delimiter, so that I don't have to escape slashes in the regex. AFAIK you can use any ASCII non-alphanumeric character.

So these are all the same:

/(.*)/
#(.*)#
!(.*)!
@(.*)@
~(.*)~

and so on.

Thanks for the tip on [ fixed ]! I can't even begin to explain the italics, either. It looks like they started after the _ in "preg_replace" :-O