andreasfriedrich - 2:40 pm on Jan 30, 2003 (gmt 0) Validating an URI is a task that appears quite often. You may have a form where users have to enter a valid URL. You might want to check that a referrer [webmasterworld.com] that you will echo on a page is valid to prevent injection of bad code and linking to bad sites. While PHP provides the
This thread continues the collection of PHP tricks in our Php bag of tricks [webmasterworld.com], which among others is referenced in the Perl and PHP CGI Scripting library [webmasterworld.com].
Validating an URI
[url=http://www.php.net/parse_url]parse_url[/url] function to parse a URL and return its components it still lacks some functionality that will come in handy when validating a URI.
Validating an URI is a task that appears quite often. You may have a form where users have to enter a valid URL. You might want to check that a referrer [webmasterworld.com] that you will echo on a page is valid to prevent injection of bad code and linking to bad sites.
While PHP provides the
Features that my
How to use my is_url function:
For the impatient among you here´s a complete example first. It checks whether the HTTP_REFERER is valid and converts it into an absolute URI before using it on our page.
Step by step guide through the above example:
Now let´s have a closer look at how it does just that.
After initializing the $cleaned and $error variables we call the is_url() function. The first parameter is the HTTP_REFERER as contained in the Referer header field in the client request header. The value of this field may be either an absolute or relative URI. Of course it would be possible to pass along just about any code that the user wants.
The second parameter specifies what components need to be present for is_url to be considering the URI to be valid. We only pass the PATH constant since that is all that is required for a relative URI. If we were to check for an absolute URI we would use SCHEME+AUTHORITY+PATH as the second argument.
As the third and fourth argument we pass references to the $error and $cleaned variables. Those will be filled with a value indicating the missing components and the cleaned components of the URI.
If there is no SCHEME and no AUTHORITY we know that we have a relative URI which we need to turn into an absolute one. The base URI that we use to resolve it is the requested URI which is contained in $_SERVER['SCRIPT_URI'].
Now can be reasonably sure1 that we have an absolute URI. If the authority component is not equal our server name the referrer is from another domain and we will not use it since we do not want to link to some other site.
When we have an absolutely URI that is from our domain we assemble the cleaned parts to form a URI again.
Code of my is_url function:
Here´s the code for is_url:
Step by step guide through the above code:
We define some constants that we use to specify the required parts and that is_url() uses to encode which parts of the URI are missing.
$string is the URI that we want to check. $components is a numeric value specifying which components are required for a valid URI. Use the constants defined above. $error is a variable that is passed by reference. It will contain a numeric value specifying which components were missing. Note that it will report all components that are missing, not just those that were required. Use the constants to decode that value. $cleaned is an array that is passed by reference. It will contain the cleaned parts of the URI, i.e. they will contain only allowed characters. Invalid characters in the query and fragment component are url_encoded.
We split the URI into its components using the regular expression given in RFC2396. We could have used PHP´s parse_uri, but I liked the RE better.
If the RE does not match at all, the given URI is something else and we return immediately with a return value of false.
Here we check the parts returned by the RE and build the $_error variable.
Checking the components for illegal characters. They are simply deleted from the URI. Note that scheme is not checked. It should be! The authority is not cleaned up as well. You can check whether $error contains AUTHORITY_WF to tell whether there are illegal characters. Better yet add some clean up code as well.
So far we just checked which components were missing. Now we need to determine whether any required parts are missing. This is done in the foreach loop. When a component is required and it is missing we add that components numeric value to the $error variable. When error is larger than zero we return false. Otherwise we return true. In both cases we assign the $error variable the value of our internal $_error variable which contains all the missing elements.
Have fun giving it a try.
1 To be absolutely sure one would need to check for cases where there is a AUTHORITY but no SCHEME.