How does preg match all work?

Forum Moderators: coopster

Message Too Old, No Replies

How does preg match all work?

using preg_match_all to extract links

majjk

6:10 pm on Dec 27, 2007 (gmt 0)

Hello everyone...
I've found a script that is validating links on a webpage. However, it seems to get into trouble when coming across links containing the tilde sign. Part of a function in this script looks like this:


preg_match_all("Śhref\=\"?'?`?([[:alnum:]:?=&@/;._-]+)\"?'?`?Śi", $html, &$matches);
$links = array();
$ret = $matches[1];
for($i=0;isset($ret[$i]);$i++) {
if(preg_match("Śhttp://[[:alnum:]:?=&@/;._-]+Śi",$ret[$i])) {
$links[] = $ret[$i];
} elseif(preg_match("Ś^/(.*)Śi",$ret[$i])) {
$links[] = "http://".$info["host"]."".$ret[$i];
} elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
}
}

Can someone please explain to me how preg_match_all works in this case, i.e. what it is looking for etc... so that I can make it accept the tilde sign.

[edited by: eelixduppy at 7:14 pm (utc) on Dec. 27, 2007]
[edit reason] disabled smileys [/edit]

d40sithui

6:38 pm on Dec 27, 2007 (gmt 0)

while i am not a regex guru, it looks like it's looking for a link like you say. the pattern looks for the "href" string, followed by the equal sign, then possibly a double quote, single quote, or apostrophe. after that it searches for any letters, any numbers and some special chars ending with double single or apostrophe again. and the i says its case insentitive. to make it search for the tilde, you can just add it to the mix where the special characters are located like so:

("Śhref\=\"?'?`?([[:alnum:]~:?=&@/;._-]+)\"?'?`?Śi")

eelixduppy

6:40 pm on Dec 27, 2007 (gmt 0)

You just need to add it to the character class like this:


preg_match_all("Śhref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?Śi", $html, &$matches);
$links = array();
$ret = $matches;
for($i=0;isset($ret[$i]);$i++) {
if(preg_match("Śhttp://[[:alnum:]:?=&@/;._-~]+Śi",$ret[$i])) {
$links[] = $ret[$i];
} elseif(preg_match("Ś^/(.*)Śi",$ret[$i])) {
$links[] = "http://".$info["host"]."".$ret[$i];
} elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
}
}

If you are interested in learning about regular expressions you should visit the following links to get you started:
[php.net...]
[php.net...]
[php.net...]

[edit]
a tad bit late :)

majjk

9:21 pm on Dec 27, 2007 (gmt 0)

Thanks. I sort of suspected that was what I had to do. The problem now is that the output from my script has changed somehow. I now get output such as;


301 Moved Permanently http://members.example.com/~abc/def/

which is fine, just what I'm looking for. Before adding the tilde it just displayed;


301 Moved Permanently http://members.example.com/

The problem is that the script now seem to have a problem with hyphens. Links with hyphen in the url now come up as;


Error: that url you entered seems not to exist.
http://www.example

even if the link is fine and the website is up and running. Before doing this change regarding the tilde it simply displayed;


 200 OK

which is how it should come up if the website is fine. The full script looks like this:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
 <meta http-equiv="Pragma" content="no-cache">
 <meta http-equiv="content-type" content="text/html; charset=utf-8">
 <title>Link checker</title>
 <body>
<?php
 function GetUrls( $url ) {
 $info = @parse_url( $url );
 $fp = @fsockopen( $info["host"], 80, $errno, $errstr, 10 );
 if (!$fp ) {
 return false;
 } else {
 if( empty( $info["path"] ) ) {
 $info["path"] = "/";
 }
 if( isset( $info["query"] ) ) {
 $query = "?".$info["query"]."";
 } else {
 $query = "";
 }
 $out = "GET ".$info["path"]."".$query." HTTP/1.0\r\n";
 $out .= "Host: ".$info["host"]."\r\n";
 $out .= "Connection: close \r\n";
 $out .= "User-Agent: free-php_org_uk_link_checker/1.0\r\n\r\n";
 fwrite( $fp, $out );
 $html = '';
 while (!feof( $fp ) ) {
 $html .= fread( $fp, 8192 );
 }
 fclose( $fp );
 }
 $pieces = explode( "\r\n\r\n", $html,2 );
 $html = $pieces[1];
 unset( $pieces );
 preg_match_all("Śhref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?Śi", $html, &$matches);
 $links = array();
 $ret = $matches[1];
 for($i=0;isset($ret[$i]);$i++) {
 if(preg_match("Śhttp://[[:alnum:]:?=&@/;._-~]+Śi",$ret[$i])) {
 $links[] = $ret[$i];
 } elseif(preg_match("Ś^/(.*)Śi",$ret[$i])) {
 $links[] = "http://".$info["host"]."".$ret[$i];
 } elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
 }
 }
 return $links;
 }
 function GetUniqueUrls( $url ) {
 $urls = GetUrls( $url );
 if (!$urls ){
 return false;
 }
 $uurls = array();
 for( $i=0;isset($urls[$i]);$i++ ) {
 if(!in_array($urls[$i], $uurls)) {
 $uurls[] = $urls[$i];
 }
 }
 return $uurls;
 }
function getheaders($url) {
 $info = @parse_url($url);
 $fp = @fsockopen($info["host"], 80, $errno, $errstr, 10);
 if (!$fp) {
 print "<br>Error: that url you entered seems not to exist.\n";
 } else {
 if(empty($info["path"])) {
 $info["path"] = "/";
 }
 if(isset($info["query"])) {
 $query = "?".$info["query"];
 } else {
 $query = "";
 }
 $out = "GET ".$info["path"]."".$query." HTTP/1.0\r\n";
 $out .= "Host: ".$info["host"]."\r\n";
 $out .= "Connection: close \r\n";
 $out .= "User-Agent: free-php_org_uk_link_checker/1.0\r\n\r\n";
 fwrite($fp, $out);
 $html = '';
 $html .= fread($fp, 1455);
 @fclose($fp);
 }
 $pieces = explode("\r\n\r\n", $html,2);
 $headerinfo = $pieces[0];
 unset($pieces);
 return $headerinfo;
}
function getsatuscode($header) {
 $headers = explode( "\r\n", $header );
 for( $i=0;isset( $headers[$i] );$i++ ) {
 if( preg_match( "/HTTP\/[0-9A-Za-z +]/i",$headers[$i] ) ) {
 $status = preg_replace( "/http\/[0-9]\.[0-9]/i","",$headers[$i] );
 }
 }
 $rules = $pieces[1];
 unset( $pieces );
 return $status;
}if(isset($_GET['url'])) {
 print "<div id=\"devcheckedlinks\">\n";
 $done = getsatuscode(getheaders($_GET['url']));
 print "Checking link: ".$_GET['url']." ...<br />\n";
 print "".$done."<br /><br />\n";
 @flush();
 @ob_flush();
 $urls = GetUniqueUrls($_GET['url']);
 for($i=0;isset($urls[$i]);$i++) {
 $done = getsatuscode(getheaders($urls[$i]));
$findcode200 = substr_count($done, "200"); //HTTP Status Code - 200 OK
$findcode302 = substr_count($done, "302"); //HTTP Status Code - 302 Found
$findcode307 = substr_count($done, "307"); //HTTP Status Code - 307 Temporary Redirect
if ($findcode200==1 ŚŚ $findcode302==1 ŚŚ $findcode307==1) {
$done = '<font color="green">' . $done . ' / </font>';
$printstuff2 = "";
}
else {
$done = '<br>' . $done;
$printstuff2 = $urls[$i]." ...<br>\n";
}
print $done."\n";
print $printstuff2;
 @flush();
 @ob_flush();
 sleep(3);
 }
 print "</div><br /><br /><b>DONE</b>\n";
} else {
?>
Type in the box below the page
uri you want to check then click the "Check links" button. Then the link
checker will crawl your web page and get the links out of it and check them.
It will return a list of links it has checked and tell you the status.</p>
<?php
 print "<form action=\"linkchecker.php\" method=\"get\">\n";
 print "<p><input type=\"text\" name=\"url\" value=\"http://\" size=\"40\" />\n";
 print "<input type=\"submit\" value=\"Check links\" />\n";
 print "</p></form>\n";
}
?>
 </body>
</html>

(Obviously the script goes all the way down till here. webmasterworld somehow doesn't want to display it as code all the way...)
Any ideas what the problem is?

[edited by: eelixduppy at 10:19 pm (utc) on Dec. 27, 2007]
[edit reason] removed specifics [/edit]

PHP_Chimp

9:52 pm on Dec 27, 2007 (gmt 0)

preg_match_all("Śhref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?Śi", $html, &$matches);

The - inside a character class means a series of characters, so you have a series of characters from _ to ~ (I have no idea what that series is). Place the - as the last character in the character class, or escape it.

A regex to do the same thing that is a little easier to read would be -


"%href=["'`]([\w:?=&@/;.~-]+)["'`]%"

All I have done is moved your "?'?`? into a character class. As at the moment you regex would work for -
href="'`somelink.html"'`
So if you put the quoting characters into a class then they cant appear multiple times in succession unless you allow them to .

<edit>
The thing with the [ code ] block stopping in the blank line you have in the code about 3 quarters of the way down. The block seems to exit at blank lines...I guess it is to allow for people that forget to end there code block.

[edited by: PHP_Chimp at 9:53 pm (utc) on Dec. 27, 2007]

majjk

7:12 pm on Dec 29, 2007 (gmt 0)

The - inside a character class means a series of characters, so you have a series of characters from _ to ~ (I have no idea what that series is). Place the - as the last character in the character class, or escape it.

I put the hyphen last, which solved all my problems.

Thanks!