Forum Moderators: coopster

Message Too Old, No Replies

Determining when a & in the URL should be literal instead of delimiter

         

csdude55

5:29 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm seeing a small amount of traffic to links that look like this:

example.com/page.php?name=foo&bar

where "name" should be "foo&bar".

All of my internal links have the "name" value encoded, but these are coming from other sites that are linking to an old format from before I had them encoded. But either way, my end thinks that name=foo and there's a separate param of bar='', and I would like to modify the script to understand what is intended.

I can think of a few ways to "fix" it, but I'm not sure if this is way too complex and there's a much easier way?

// option 1, tested and works but it's the slowest to process
// surprisingly, this is marginally faster without in_array()
if (in_array(false, $_GET, true)) {
if (preg_match('/(?:^|&)name=(\w+&\w+)(?:&|$)/', $_SERVER['QUERY_STRING'], $matches))
$_GET['name'] = $matches[1];
}

// option 2, tests faster and allows for multiple &, but it doesn't guarantee that they're inline so
// it would convert "$name=foo&lorem=ipsum&bar" to "$name=foo&bar"
if (in_array(false, $_GET, true)) {
foreach ($_GET as $key => $val) {
if (!$val) {
$_GET['name'] = $_GET['name'] . '&' . $key;
unset($_GET[$key]);
}
}
}


I feel like the first option is the safest, but the second is by far the fastest. I benchmarked both options with and without in_array over 10,000 iterations, and the results were:

in_array, then foreach _GET: 0.0005497932434082
foreach _GET: 0.00074601173400879
preg_match: 0.0018899440765381
in_array, then preg_match: 0.0020120143890381

Before I run with the second one, is there a better way to accomplish this (read, "faster" and/or "safer")?

lucy24

5:45 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Before getting hung up on speed, consider how often these requests actually occur. Sometimes saving your own sanity is worth a few milliseconds. Of all requests involving the parameter “name”, what proportion then run afoul of the spurious ampersand? Is it feasible to ask the linking sites to update?

As an entirely different alternative, could the requests be redirected (assuming Apache) before they ever get to the php handler? My first thought would be to pick apart {QUERY_STRING} --which is treated as a single literal string--and send requests where you want them to go. This might also be helpful if search engines are still requesting the old form.

w3dk

6:10 pm on Jan 11, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



> Before I run with the second one

Just to emphasis lucy24's post... "speed" is not a factor here. Code readability and reliability are the only factors here. You've already demonstrated that the second one is functionally incorrect, so why go with that?


if (in_array(false, $_GET, true)) {


Not sure why you are calling this as an initial check? (bool)false is never normally a value in the $_GET array unless you have already manually edited the $_GET array (generally a bad idea). And there wouldn't seem to be a need to check the entire array, as you only need to check if $_GET['name'] is already set?

csdude55

6:20 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I get an email when they happen in case there's a crack attempt that needs my attention. I get maybe 3 or 4 of these each day (out of 100,000 requests), so they're relatively rare but worth doing SOMETHING to fix. I'm also trying to forbid unapproved params (in a separate thread), so my main goal is to make sure that I'm not squashing these links by mistake.

The ones I saw yesterday all came from Facebook, but I'm not sure what the actual page is. I'm guessing that my user shared his page from my site, though. And like you said, search engines could be pinging them, too.

I'm totally open to fixing it in Apache... are you thinking something like this?

RewriteCond name=(\w+)&(\w+)(?:&|$) [NC]
RewriteRule ^ ${'SCRIPT_NAME'}?name=%1\%26%2 [QSA,NC,L]


That wouldn't remove the original foo= and bar, though, would it?

csdude55

6:31 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Code readability and reliability are the only factors here. You've already demonstrated that the second one is functionally incorrect, so why go with that?

@w3dk, generally speaking I can read a loop and an array easier than a regex, which is why the second one (with the foreach()) was my original solution. And I feel like I've read that if you have to use a regex then you've already lost? Or something like that... LOL

The second one also allows for names with multiple &, but so far I haven't seen that so I don't think that's really relevant. At least, not yet.

Not sure why you are calling this as an initial check?

My logic was to see if there are any $_GET keys with a false value, which I think would only be true if there's a &bar key set with no value. If this is false then there's no need to move on to the loop. Is that logic wrong? It made it process marginally faster, but at 0.0002 it's not exactly a relevant enough bump if it can cause incorrect matches.

w3dk

7:07 pm on Jan 11, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



My logic was to see if there are any $_GET keys with a false value, which I think would only be true if there's a &bar key set with no value. If this is false then there's no need to move on to the loop. Is that logic wrong?


If there is a "bar" URL parameter with no value then it would be set to an empty string, not (bool)false. Since you are doing a strict comparison, it will never match and the internal preg_match() or foreach() loop is never actually processed?!

It made it process marginally faster


Well, the only way the outer in_array() could have made it run faster was if it always evaluated to false and the inner code was never actually run?!


I'm totally open to fixing it in Apache...


IMO this should be done in PHP, not Apache. (And I believe lucy24 was suggesting an external redirect to help SEO.)

Doing this in Apache isn't necessarily going to be any more performant for your visitors, since it is being processed on every request, not just the specific page requests that need to check this. PHP is also going to be arguably easier to maintain.

w3dk

8:05 pm on Jan 11, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



And I feel like I've read that if you have to use a regex then you've already lost? Or something like that... LOL


Hhhmm, yes I've heard that, but context is key and is missing here. Sometimes a regex is just the right tool for the job.

Just a thought regarding the traversal of the $_GET array (like option 2)... I wouldn't necessarily assume that the order of the elements in the $_GET array is guaranteed to be the same as the order of the parameters in the query string.

option 2 would also break if "bar=" or "bar=0" was present in the query string.

csdude55

8:12 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Erp, that'll do it! LOL Removing the strict value from in_array (the 3rd element of "true") definitely changed the speed:

_GET 0.0031840801239014
in_array, _GET: 0.0037670135498047
in_array, preg_match: 0.0055911540985107
preg_match 0.0083949565887451

That's with me manually assigning a string of $_SERVER['QUERY_STRING'] and array elements inside of the iteration. If you're interested:

[sandbox.onlinephpfunctions.com...]

I mean, 0.005s over 10,000 iterations is irrelevant in the grand scheme of things, so it really is a matter of readability and which actually works better. In my case, I think that it's more important to ensure that the empty param is immediately following the "name" param than to allow for multiple &, so in_array... preg_match is probably my best solution.

csdude55

8:13 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



option 2 would also break if "bar=" or "bar=0" was present in the query string.

That's why I was using STRICT in in_array... what would be the appropriate value to check for if bar is set with no value? Or does it vary?

w3dk

8:43 pm on Jan 11, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



That's why I was using STRICT in in_array...


But you were using a loose comparison in the inner foreach() loop. ie. "if (!$val)" - both "" (empty string) and "0" would evaluate to false here.

If the "bar" URL parameter is present but has no value then $_GET['bar'] is an empty string. But note that "?bar" and "?bar=" in the query string would both result in the same $_GET value. You could only tell the difference in this regard by examining the original QUERY_STRING and explicitly checking for (the absence of) a trailing "=".

csdude55

9:11 pm on Jan 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ohh, I gotcha. Using preg_match resolves that issue, too, so that seems to be the best solution. Which I've slightly modified to catch and unset the empty parameter:

if (in_array(false, $_GET) &&
preg_match('/(?:^|&)name=(\w+&(\w+))(?:&|$)/', $_SERVER['QUERY_STRING'], $matches)) {
$_GET['name'] = $matches[1];
unset($_GET[$matches[2]]);
}