Fastest way to remove 3 potential matches from a string - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Fastest way to remove 3 potential matches from a string

csdude55

7:27 am on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm removing 3 parameters from the query string: h, q, and start. I've been doing this:

$r_uri = $_SERVER['QUERY_STRING'];

$r_uri = str_replace("h=" . $_GET['h'], "", $r_uri);
$r_uri = str_replace("q=" . $_GET['q'], "", $r_uri);
$r_uri = str_replace("start=" . $_GET['start'], "", $r_uri);

$r_uri = rtrim($r_uri, "&");

This isn't perfect because the parameters could be anywhere in the string, so I could potentially end up with multiple & in a row. But that doesn't really hurt anything so it's not a big deal.

But since I'm rebuilding everything and this is on every page, I thought I might see if I could speed it up a few microseconds.

I've read some benchmarks that find a single preg_replace is about 37% slower than a single str_replace. But since I'm doing 3 str_replace() AND an rtrim(), am I correct that a single preg_replace would be faster?

Something like this (not tested, just typed up at 3am):

$r_uri = preg_replace('/((h=' . $_GET['h'] . ')|(q=' . $_GET['q'] . ')|(start=' . $_GET['start'] . '))&?/gi, '', $r_uri);

Going further, I've read that sprintf() is usually by far the fastest, but I really don't use sprintf() a lot so I'm not super comfortable with the formatting. Is there a way that I could accomplish the same thing using sprintf()?

csdude55

7:31 am on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Note, I realize that in this example it's probably faster to use a loop and avoid expressions altogether, like this:

$r_uri = false;

foreach ($_GET as $key) {
 if ($key != 'h' && $key != 'q' && $key != 'start')
  $r_uri .= $key . '=' . $_GET[$key] . '&';
}

if ($r_uri)
 $r_uri = rtrim($r_uri, '&');

but I'm using this as an excuse to maybe learn something about sprintf(), or maybe even learn more about the speeds between each of them.

Selen

2:23 pm on Sep 17, 2019 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I'd not use sprintf() since it's slower than preg_replace or str_replace.

If the string always (or most of the times) contains one of the parameters to be removed, I'd use preg_replace (as above, without using the IF conditions) since it's most likely to be the fastest of them. Otherwise, I'd use strpos (or stripos) to check if a string contains a parameter and then use str_replace (or str_ireplace) and rtrim as in your example.

brotherhood of LAN

2:33 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Have a look at [php.net...]

Something like
parse_str($_SERVER['QUERY_STRING'],$res);
@unset($res['h'],$res['q'],$res['start']);
echo http_build_query($res);

Haven't tested it but it should work

lucy24

6:08 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I could potentially end up with multiple & in a row

You could always globally replace
&{2,} or &&+
with
&
alone, if the && sequence makes you anxious. (It would me :))

I'd think about a pattern involving global delete of
\b(h|q|start)=[^&]*
and then reduce any resulting && separately, because sometimes two steps forward and one back is less bother than doing it all in one step. I said * rather than + in case you accidentally find yourself with a null parameter.

csdude55

7:14 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'd not use sprintf() since it's slower than preg_replace or str_replace.

Is it? My information came from this discussion in 2007, so it could be outdated by now:

[simplemachines.org...]

From their tests, sprintf() seemed to consistently be the fastest. There was some debate on whether strtr() might be faster, but all of the tests had sprintf() as faster than str_replace() or preg_replace().

Something like
parse_str($_SERVER['QUERY_STRING'],$res);
@unset($res['h'],$res['q'],$res['start']);
echo http_build_query($res);

That's pretty slick :-) I've used parse_str() a lot, but http_build_query() was a new one for me!

You could always globally replace
&{2,} or &&+
with
&
alone, if the && sequence makes you anxious. (It would me :))

I'd think about a pattern involving global delete of
\b(h|q|start)=[^&]*
and then reduce any resulting && separately, because sometimes two steps forward and one back is less bother than doing it all in one step. I said * rather than + in case you accidentally find yourself with a null parameter.

You lead me to another question... is there any speed difference to use a wildcard versus the real data? I've always tried to use the most specific data possible in an effort to limit the number of wasted cycles that the regex performs, but I don't know if that's really valuable.

lucy24

8:12 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

is there any speed difference to use a wildcard versus the real data?

It depends what you mean by wildcard. Notably, {site that really ought to know better} is extremely fond of giving examples in the form
.*exact-string-here
which is grossly inefficient, especially when there's more than one .* plus exact-string sequence in the same pattern. You definitely want to avoid anything that calls for repeated backtracking �Oh, whoops, I was supposed to leave room for this specific other string after the wildcard capture�.

You could do benchmark testing on patterns like
[^&]+
vs.
[\w.,-]+
(I'm making this up at random), but I really doubt the difference is significant in any real-life application. As so often, it ends up being more a matter of personal coding style: which form of the expression will be least likely to cause you distress when you come back in two years' time to fine-tune it? If you then have to go look up �what the heck does this part mean?� then the time you spend refereshing your memory might outweigh all those picoseconds you saved over the years.

Selen

8:49 pm on Sep 17, 2019 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I meant PHP7, not the 2007 reality ;)

Another point, I can tell you that I've been through the "making the site as fast as possible" phase, but it will not improve your rankings at all. You'll spend weeks on it, your site will load in less than 500ms, but it won't matter as far as your rankings are concerned.

csdude55

11:12 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@lucy24, I gotcha. As a guideline I always try to make my regex as specific as I can; eg, "h" and "start" will always be a number (if they exist at all) so I could use [0-9]* instead of [^&]*. In theory that would prevent the regex from looking for letters and characters so it should be marginally faster? But then, if I can plug in "123456" instead of [0-9] then it would have even fewer things to consider, in theory making it even faster still.

I only have one server to test with, and since it has live websites on it then none of my benchmark tests are really fair :-( So I kinda have to run with what others have experienced and hope for the best.

@Selen, I'm really not concerned with rankings. It looks like the wide majority of my traffic comes from people going directly to my domain or searching for the domain, and it's been like that for YEARS!

But what I have found is that the faster I can get each page to load, the more pageviews per session I get. It's like, my average user is willing to give me about 10 minutes of attention, regardless of whether they look at 4 pages or 20 within those 10 minutes. So as I'm rebuilding, I figure it's worth every little microsecond here and there... worst case scenario it doesn't make any difference, but best case I increase my pageviews a little (which means more ad impressions) :-)

penders

11:43 pm on Sep 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm curious how you were intending to use sprintf() here? It doesn't seem like the right tool for the job, regardless of performance?

If you go the regex / string replacement route then you would need a regex like lucy24 suggests I would think (checking word boundaries, or more specifically checking for start/end anchors or "&"). An issue with your original str_replace() (and initial preg_replace() suggestion) are that they potentially replace too much. They replace any URL parameter name that ends with "h" or ends with "q" or ends with "start" (although by matching the specific value you are perhaps preventing this to some extent).

...but I really doubt the difference is significant in any real-life application. ...

With regards to performance then by all means analyse it from "academic" standpoint, but otherwise I agree with lucy24... the focus should probably be on readable/maintainable code and not so much on performance in this instance. What you are doing looks like something you'd need to do once per request? In which case, any of the above working methods would be indistinguishable from the next - performance wise. If, however, you were performing 1000's of such replacements per request then that could be a different matter.

penders

12:02 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I always try to make my regex as specific as I can; eg, "h" and "start" will always be a number (if they exist at all) so I could use [0-9]* instead of [^&]*. In theory that would prevent the regex from looking for letters and characters so it should be marginally faster? But then, if I can plug in "123456" instead of [0-9] then it would have even fewer things to consider, in theory making it even faster still.

But [^&]* isn't necessarily "looking for letters and characters", it's simply looking for "not &".

But also... what would be the desired result should an "invalid" character be injected - do you want that URL parameter removed regardless? eg. If "h=123a" was present and the regex specifically matches digits, then this URL parameter remains. Otherwise, if you use the generalised regex, it will be removed.

csdude55

2:39 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm curious how you were intending to use sprintf() here? It doesn't seem like the right tool for the job, regardless of performance?

I had no clue, honestly, I just saw in the link that I gave before that it was benchmarked against the other commands. I rarely use sprintf() other than to send MySQL commands, so I didn't know if there was a method to use it that I didn't know.

I've done several informal bench tests, though, and so far I'm surprised at the results! In order of fastest to slowest:

Using three str_replace() and rtrim():
1.38

Using the foreach ($_GET as $key => $val) loop:
1.90

Using the regex wildcard, preg_replace('/(h|start|q)=[^&]*/i'...:
2.00

Using parse_str and http_build_query:
5.79

Using the regex without the wildcard, preg_replace('/((h=' . $_GET['h']...:
6.89

I can post the entire script that I used if you guys are interested, but I was surprised that using several str_replace was still the fastest. I was more surprised, though, that using preg_replace with the wildcard was faster than without the wildcard.

csdude55

2:42 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I should note that I'm using PHP 5.6.40. Someone mentioned that 7.x requires mysqli and my sites all use mysql, so I can't update until I've rebuilt everything.

brotherhood of LAN

5:43 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Using parse_str and http_build_query:

You won't need to use parse_str in your example since you have the vars available in _GET - I just included it since your OP used QUERY_STRING as the input, though maybe you wan't to keep the unsetted vars for later.

csdude55

7:22 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

True, but I need the params later on in the script so I can't just remove them from $_GET.

Modifying it to this, though, cut the time down to 1.00! Which would make it the fastest option of all:

$qs = $_GET;

// why did you have @unset? That threw an error for me so I'm guessing it was a typo
unset($qs['h'], $qs['q'], $qs['start']);

// rebuild it manually instead of using http_build_query
if (count($qs) > 0) {
 $r_uri .= '?';

 foreach ($qs as $key => $val)
  $r_uri .= $key . '=' . $val . '&';
}

// removing this had practically no impact on the speed
$r_uri = rtrim($r_uri, '&');

brotherhood of LAN

7:37 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You're right about the @, I don't use PHP as much lately so a bit rusty.

foreach ($qs as $key => $val)
$r_uri .= $key . '=' . $val . '&';

Bear in mind that _GET vars are not url encoded, using something like http_build_query would re-encode them, which no doubt makes it slower

csdude55

7:43 am on Sep 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

That's a good point. In my particular case I'm using $foo = urlencode($r_uri)/ afterward, anyway, so I don't think it would matter. But it might make a different for future readers.

ergophobe

11:02 pm on Sep 21, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I've read some benchmarks that find a single preg_replace is about 37% slower than a single str_replace.

Way back when (perhaps 15 years ago now), I benchmarked several different text manipulations that were supposedly faster or slower. To get a consistent measurable number (more than 1ms), I typically had to run these hundreds or thousands of times on small strings.

Worrying about optimizations like this is just completely foolish unless every CPU cycle matters and you have maxed out every other optimization possibility. Rather than focusing on differences like this which probably makes less than 1ms difference in page generation, focus on things with huge differences that can take seconds off. Those primarily fall into a couple of categories

- very slow queries
- front end optimizations (critical path CSS, image optimization, lazy loading images below the fold, etc).

As for the code in question, which is parsing (apparently) just the incoming request (and therefore is only parsing on URL per page load), you should choose the code that is the most maintainable for you. Code it however you understand it best, then move on to profiling your queries and your front end load times.

Obviously, if you *have* profiled the page and you have verified that this is a bottleneck, that's a different story.

I one time was using PHP to get the image size so that I could inject it dynamically into the HTML to avoid reflow and in theory speed up page load. It turned out PHP (at the time at least) was very bad and that and 90% of the page load time was devoted to getting the image sizes. So I'm not saying that you should avoid PHP optimizations altogether. I'm just saying that with small strings, the difference between one text processing method and another is probably not in the top 100 optimizations you could make on your site and the time spent thinking about it could be used way more productively on something else.

[edited by: ergophobe at 11:06 pm (utc) on Sep 21, 2019]

ergophobe

11:03 pm on Sep 21, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

[webmasterworld.com...]