homepage Welcome to WebmasterWorld Guest from 54.197.147.90
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

This 41 message thread spans 2 pages: < < 41 ( 1 2 > >     
Bag-O-Tricks for PHP II
some code snippets that should be helpful for all in creating dynamic sites
andreasfriedrich




msg:1296699
 2:40 pm on Jan 30, 2003 (gmt 0)

This thread continues the collection of PHP tricks in our Php bag of tricks [webmasterworld.com], which among others is referenced in the Perl and PHP CGI Scripting library [webmasterworld.com].


Validating an URI

Validating an URI is a task that appears quite often. You may have a form where users have to enter a valid URL. You might want to check that a referrer [webmasterworld.com] that you will echo on a page is valid to prevent injection of bad code and linking to bad sites.

While PHP provides the [url=http://www.php.net/parse_url]parse_url[/url] function to parse a URL and return its components it still lacks some functionality that will come in handy when validating a URI.

-o0o-

Features that my is_url function provides:

  • Lets you specify which components are required
  • Reports which components are missing
  • Cleans up the parts to comply with RFC2396 and return them as an array
  • Returns true if all required components are present

-o0o-

How to use my is_url function:

For the impatient among you here´s a complete example first. It checks whether the HTTP_REFERER is valid and converts it into an absolute URI before using it on our page.


$cleaned = array();
$error = 0;
if (is_url($_SERVER['HTTP_REFERER'], PATH, $error, $cleaned)) {
if ($error & (SCHEME + AUTHORITY)) {
$_SERVER['HTTP_REFERER'] = make_abs($_SERVER['HTTP_REFERER'],
$_SERVER['SCRIPT_URI']);
} elseif ($cleaned['authority']!= $_SERVER['SERVER_NAME']) {
echo "Referer is from other domain. We do not include it.";
} else {
$_SERVER['HTTP_REFERER'] = make_uri($cleaned);
echo "<br>$_SERVER[HTTP_REFERER]";
}
} else {
echo "errors: $error<br>";
}

-o0o-

Step by step guide through the above example:

Now let´s have a closer look at how it does just that.


$cleaned = array();
$error = 0;
if (is_url($_SERVER['HTTP_REFERER'], PATH, $error, $cleaned)) {

After initializing the $cleaned and $error variables we call the is_url() function. The first parameter is the HTTP_REFERER as contained in the Referer header field in the client request header. The value of this field may be either an absolute or relative URI. Of course it would be possible to pass along just about any code that the user wants.

The second parameter specifies what components need to be present for is_url to be considering the URI to be valid. We only pass the PATH constant since that is all that is required for a relative URI. If we were to check for an absolute URI we would use SCHEME+AUTHORITY+PATH as the second argument.

As the third and fourth argument we pass references to the $error and $cleaned variables. Those will be filled with a value indicating the missing components and the cleaned components of the URI.


if ($error & (SCHEME + AUTHORITY)) {
$_SERVER['HTTP_REFERER'] = make_abs($_SERVER['HTTP_REFERER'],
$_SERVER['SCRIPT_URI']);
}

If there is no SCHEME and no AUTHORITY we know that we have a relative URI which we need to turn into an absolute one. The base URI that we use to resolve it is the requested URI which is contained in $_SERVER['SCRIPT_URI'].


elseif ($cleaned['authority']!= $_SERVER['SERVER_NAME']) {
echo "Referer is from other domain. We do not include it.";
}

Now can be reasonably sure1 that we have an absolute URI. If the authority component is not equal our server name the referrer is from another domain and we will not use it since we do not want to link to some other site.


} else {
$_SERVER['HTTP_REFERER'] = make_uri($cleaned);
echo "<br>$_SERVER[HTTP_REFERER]";
}

When we have an absolutely URI that is from our domain we assemble the cleaned parts to form a URI again.

-o0o-

Code of my is_url function:

Here´s the code for is_url:


define(SCHEME, 1);
define(AUTHORITY, 2);
define(PATH, 4);
define(QUERY, 8);
define(FRAGMENT, 16);
define(AUTHORITY_WF, 32);# AUTORITY_WELLFORMED
function is_url($string, $components, &$error, &$cleaned) {
$error = 0; // first clear error variable
$_error = 0;
#
$ret = ereg("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?",
$string, $regs);
#
// return false if we were not even able to parse uri
if (!$ret) return false;
#
// check the seperate parts
if (empty($regs[2])) $_error += SCHEME;
if (empty($regs[4])) $_error += AUTHORITY;
if (!empty($regs[4]) and strcmp($regs[2], 'http') == 0) {
// do we have an ok hostname?
if (!ereg("((([a-z0-9]+)[a-z0-9_]¦\\-)+\\.)+".// subdomain + domain
"[a-z]{2,4}".// TLD
":?[0-9]{0,5}$",// port
$regs[4])) {
$_error += AUTHORITY_WF;
}
}
if (empty($regs[5])) $_error += PATH;
if (empty($regs[7])) $_error += QUERY;
if (empty($regs[9])) $_error += FRAGMENT;
#
if ($cleaned!= '') {
$cleaned['scheme'] = $regs[2];
$cleaned['authority'] = $regs[4];
$cleaned['path'] =
preg_replace("{[^-/:@&=+$,_.!~*()'a-zA-Z0-9]}", '', $regs[5]);
$cleaned['query'] =
preg_replace("{[^-;/?:@&=+$,_.!~*'()A-Za-z0-9%]}", '',
urlencode_querystring($regs[7]));
$cleaned['fragment'] =
preg_replace("{[^-;/?:@&=+$,_.!~*'()A-Za-z0-9%]}", '',
urlencode($regs[9]));
}
#
foreach (array(SCHEME, AUTHORITY, AUTHORITY_WF, PATH, QUERY, FRAGMENT)
as $comp) {
if ($components & $comp and $_error & $comp) $error += $comp;
}
#
if ($error > 0) {
$error = $_error;
return false;
}
$error = $_error;
return true;
}

-o0o-

Step by step guide through the above code:


define(SCHEME, 1);
define(AUTHORITY, 2);
define(PATH, 4);
define(QUERY, 8);
define(FRAGMENT, 16);
define(AUTHORITY_WF, 32);# AUTORITY_WELLFORMED

We define some constants that we use to specify the required parts and that is_url() uses to encode which parts of the URI are missing.


function is_url($string, $components, &$error, &$cleaned) {

$string is the URI that we want to check. $components is a numeric value specifying which components are required for a valid URI. Use the constants defined above. $error is a variable that is passed by reference. It will contain a numeric value specifying which components were missing. Note that it will report all components that are missing, not just those that were required. Use the constants to decode that value. $cleaned is an array that is passed by reference. It will contain the cleaned parts of the URI, i.e. they will contain only allowed characters. Invalid characters in the query and fragment component are url_encoded.


$ret = ereg("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?",
$string, $regs);
#
// return false if we were not even able to parse uri
if (!$ret) return false;

We split the URI into its components using the regular expression given in RFC2396. We could have used PHP´s parse_uri, but I liked the RE better.

If the RE does not match at all, the given URI is something else and we return immediately with a return value of false.


if (empty($regs[2])) $_error += SCHEME;
if (empty($regs[4])) $_error += AUTHORITY;
if (!empty($regs[4]) and strcmp($regs[2], 'http') == 0) {
// do we have an ok hostname?
if (!ereg("((([a-z0-9]+)[a-z0-9_]¦\\-)+\\.)+".// subdomain + domain
"[a-z]{2,4}".// TLD
":?[0-9]{0,5}$",// port
$regs[4])) {
$_error += AUTHORITY_WF;
}
}
if (empty($regs[5])) $_error += PATH;
if (empty($regs[7])) $_error += QUERY;
if (empty($regs[9])) $_error += FRAGMENT;

Here we check the parts returned by the RE and build the $_error variable.


$cleaned['scheme'] = $regs[2];
$cleaned['authority'] = $regs[4];
$cleaned['path'] =
preg_replace("{[^-/:@&=+$,_.!~*()'a-zA-Z0-9]}", '', $regs[5]);
$cleaned['query'] =
preg_replace("{[^-;/?:@&=+$,_.!~*'()A-Za-z0-9%]}", '',
urlencode_querystring($regs[7]));
$cleaned['fragment'] =
preg_replace("{[^-;/?:@&=+$,_.!~*'()A-Za-z0-9%]}", '',
urlencode($regs[9]));

Checking the components for illegal characters. They are simply deleted from the URI. Note that scheme is not checked. It should be! The authority is not cleaned up as well. You can check whether $error contains AUTHORITY_WF to tell whether there are illegal characters. Better yet add some clean up code as well.


foreach (array(SCHEME, AUTHORITY, AUTHORITY_WF, PATH, QUERY, FRAGMENT)
as $comp) {
if ($components & $comp and $_error & $comp) $error += $comp;
}
#
if ($error > 0) {
$error = $_error;
return false;
}
$error = $_error;
return true;

So far we just checked which components were missing. Now we need to determine whether any required parts are missing. This is done in the foreach loop. When a component is required and it is missing we add that components numeric value to the $error variable. When error is larger than zero we return false. Otherwise we return true. In both cases we assign the $error variable the value of our internal $_error variable which contains all the missing elements.

-o0o-

Have fun giving it a try.

Andreas

-o0o-


Note: The WebmasterWorld posting software deletes spaces preceding the exclamation point "!" character. It also replaces a solid vertical pipe symbol with a broken vertical pipe "¦" symbol. Both of these changes will need to be undone in any code you copy from WebmasterWorld. Make sure to include a space preceding the "!" in mod_rewrite code, and always replace "¦" with a solid vertical pipe.

1 To be absolutely sure one would need to check for cases where there is a AUTHORITY but no SCHEME.

 

This 41 message thread spans 2 pages: < < 41 ( 1 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved