Forum Moderators: coopster

Message Too Old, No Replies

parsing g urls

         

smallcompany

3:45 am on Mar 2, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a continuous venture with my PHP based tracking script. I have an issue, and a curiosity question.

Curiosity question:
After a recent PHP update, a script went broke, then got fixed with the help of a couple of people from this community. As I turned on all of the PHP errors and notices (6143), I went after a notice that said: Undefined offset: 1
The code causing it was
list(,$querystring) = split("\?", $querystring);

After looking around, I came up with this:
list(,$querystring) = array_pad(split("\?", $querystring),2,null);

I sort of understand that by doing this, I have excluded an "empty" value, but it's still quite foggy to me. For example, what's the meaning of "2" in the code?
No errors, no more notices, script works for this part.

An issue:
One part of the script deals with Google, both ads and organic search. This part works fine except that from time to time it will throw in the https:// from parsed URL. Here is the code:
function google ($ques, $querystring, $referer, $url)
{
$patterns = array('/www\./', '/\.com/' , '/\.co/', '/google\./');
$replacements = array('', '', '', '');

list(,$querystring) = array_pad(split("\?", $querystring),2,null);

$v2 = preg_replace("/^([^\&]+\&)*$ques=([^\&]*)(\&[^\&]+)*$/", "$2", $querystring);

// check for google.com/google.co.country/google.com.country/google.country with or without www .
$country = preg_replace($patterns, $replacements, $url);

// (country == 'google') => non-country case, www.google.com/google.com
if ($country == "google") $country = "US";
return array($v2, $referer . "-$country");
}

The code above will basically get the country code from URL. If it's google.com, it'll come out as 01-US (01 is a reference for google from an array that is a part of the script). If it's i.e. G from Germany, it'll come out as 01-de, and so on.
Now, when that https:// shows up, it is like this within the variable (few examples):
01-https://de
01-https://google
01-https://fr
01-https://br
...

My limited knowledge tells me that https:// should be a part of that regex line in the $patterns = array.
The script has more lines of the code and I do have a part where I have addressed the http(s) issue I had in the past:
$url = preg_replace("|https?://([^\/]+)/.*|", "$1", $_SERVER['HTTP_REFERER']);
$nonurl = preg_replace("|https?://([^\/]+)/(.*)|", "$2", $_SERVER['HTTP_REFERER'])

but that obviously did not cover the issue I have outlined here.

In any case, how can I ensure that http(s):// part does not show up?

Thank you

whitespace

9:03 am on Mar 2, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



I went after a notice that said: Undefined offset: 1


You should certainly try to avoid E_NOTICE messages (they can always be avoided), however, they don't necessarily indicate an error. But it is an early warning signal that there could be an error. So, unless you know the code, you don't know whether it is an error or not.

PHP will return NULL when you try to access an array offset that is undefined (E_NOTICE: "Undefined offset"). In the case of your original script, this is probably "OK" (assuming it is OK / expected to not have a query string sometimes)? However, if the $querystring parameter should always contain a query string, then this is an error! Simply "masking" the E_NOTICE with your modified code might simply be masking the error! It is not always correct to simply make the notice "go away".


list(,$querystring) = array_pad(split("\?", $querystring),2,null);


The list() construct is just a convenient way to assign values of an array to variables. The first value of the array is assigned to the first variable, the second to the second variable, etc. In this example, the first variable is omitted, so $querystring (the 2nd variable) will be assigned the value of the 2nd array element (returned from the split() function). If the 2nd array element (offset 1) does not exist (ie. there is no query string) then you get the E_NOTICE message and NULL is assigned.

The array_pad() function call, with the value 2 as the second argument, simply ensures that a 2 element array is always returned. The array is padded with null (3rd argument) values if it is not long enough. So, you don't get an E_NOTICE message when accessing the second element, because there is always a second element.

However, this is a bit messy. split() is deprecated, no need to use a regex or multiple function calls and... PHP provides a function specifically for doing this!


$querystring = parse_url($querystring, PHP_URL_QUERY);


This should do exactly the same thing as your code above - providing $querystring is not hideously malformed. $querystring will be NULL if there is no query string portion.

(To be continued...)

smallcompany

5:43 pm on Mar 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And again, thanks very much! I get it better now.

Also, I have used the suggested line of code with on negative impact - the variables are coming through - so I consider this part as optimized, thank you.

whitespace

7:47 pm on Mar 9, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Sorry, intended getting back to this...

My limited knowledge tells me that https:// should be a part of that regex line in the $patterns = array


Well, yes, that would certainly seem to remove "https://" from the result (carefull with those slashes though). And turn "01-https://google" into "01-US" (should that not be a lowercase "us"?). But, in the code you've posted you don't actually need the regex-bits, a "simple" string replace would suffice. (Then again, do you need any of that? It looks like you are stripping away all the bits you don't want, to be left with what you do? Why not just go straight for the jugular and pluck out what you do want? The TLD?)

To be honest, I find the whole function a bit perplexing. It seems overly complex for what it does - but I don't really know what it does. What are the inputs and expected output?

...except that from time to time it will throw in the https:// from parsed URL.


Why does this only happen "from time to time"? One of the inputs would seem to be different "from time to time". Why? Is that expected?

...how can I ensure that http(s):// part does not show up?


Could it be "http://" as well? Although Google does appear to be universally "https://" these days? Regex: "@https?://@".