Forum Moderators: coopster
preg_match_all("多ref\=\"?'?`?([[:alnum:]:?=&@/;._-]+)\"?'?`?夷", $html, &$matches);
$links = array();
$ret = $matches[1];
for($i=0;isset($ret[$i]);$i++) {
if(preg_match("多ttp://[[:alnum:]:?=&@/;._-]+夷",$ret[$i])) {
$links[] = $ret[$i];
} elseif(preg_match("回/(.*)夷",$ret[$i])) {
$links[] = "http://".$info["host"]."".$ret[$i];
} elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
}
}
Can someone please explain to me how preg_match_all works in this case, i.e. what it is looking for etc... so that I can make it accept the tilde sign.
[edited by: eelixduppy at 7:14 pm (utc) on Dec. 27, 2007]
[edit reason] disabled smileys [/edit]
("多ref\=\"?'?`?([[:alnum:]~:?=&@/;._-]+)\"?'?`?夷")
preg_match_all("多ref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?夷", $html, &$matches);
$links = array();
$ret = $matches;
for($i=0;isset($ret[$i]);$i++) {
if(preg_match("多ttp://[[:alnum:]:?=&@/;._-~]+夷",$ret[$i])) {
$links[] = $ret[$i];
} elseif(preg_match("回/(.*)夷",$ret[$i])) {
$links[] = "http://".$info["host"]."".$ret[$i];
} elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
}
}
If you are interested in learning about regular expressions you should visit the following links to get you started:
[php.net...]
[php.net...]
[php.net...]
[edit]
a tad bit late :)
301 Moved Permanently http://members.example.com/~abc/def/
which is fine, just what I'm looking for. Before adding the tilde it just displayed;
301 Moved Permanently http://members.example.com/
The problem is that the script now seem to have a problem with hyphens. Links with hyphen in the url now come up as;
Error: that url you entered seems not to exist.
http://www.example
even if the link is fine and the website is up and running. Before doing this change regarding the tilde it simply displayed;
200 OK
which is how it should come up if the website is fine. The full script looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Link checker</title>
<body>
<?php
function GetUrls( $url ) {
$info = @parse_url( $url );
$fp = @fsockopen( $info["host"], 80, $errno, $errstr, 10 );
if (!$fp ) {
return false;
} else {
if( empty( $info["path"] ) ) {
$info["path"] = "/";
}
if( isset( $info["query"] ) ) {
$query = "?".$info["query"]."";
} else {
$query = "";
}
$out = "GET ".$info["path"]."".$query." HTTP/1.0\r\n";
$out .= "Host: ".$info["host"]."\r\n";
$out .= "Connection: close \r\n";
$out .= "User-Agent: free-php_org_uk_link_checker/1.0\r\n\r\n";
fwrite( $fp, $out );
$html = '';
while (!feof( $fp ) ) {
$html .= fread( $fp, 8192 );
}
fclose( $fp );
}
$pieces = explode( "\r\n\r\n", $html,2 );
$html = $pieces[1];
unset( $pieces );
preg_match_all("多ref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?夷", $html, &$matches);
$links = array();
$ret = $matches[1];
for($i=0;isset($ret[$i]);$i++) {
if(preg_match("多ttp://[[:alnum:]:?=&@/;._-~]+夷",$ret[$i])) {
$links[] = $ret[$i];
} elseif(preg_match("回/(.*)夷",$ret[$i])) {
$links[] = "http://".$info["host"]."".$ret[$i];
} elseif(preg_match("/^mailto:(.*)/i",$ret[$i])) {
}
}
return $links;
}
function GetUniqueUrls( $url ) {
$urls = GetUrls( $url );
if (!$urls ){
return false;
}
$uurls = array();
for( $i=0;isset($urls[$i]);$i++ ) {
if(!in_array($urls[$i], $uurls)) {
$uurls[] = $urls[$i];
}
}
return $uurls;
}
function getheaders($url) {
$info = @parse_url($url);
$fp = @fsockopen($info["host"], 80, $errno, $errstr, 10);
if (!$fp) {
print "<br>Error: that url you entered seems not to exist.\n";
} else {
if(empty($info["path"])) {
$info["path"] = "/";
}
if(isset($info["query"])) {
$query = "?".$info["query"];
} else {
$query = "";
}
$out = "GET ".$info["path"]."".$query." HTTP/1.0\r\n";
$out .= "Host: ".$info["host"]."\r\n";
$out .= "Connection: close \r\n";
$out .= "User-Agent: free-php_org_uk_link_checker/1.0\r\n\r\n";
fwrite($fp, $out);
$html = '';
$html .= fread($fp, 1455);
@fclose($fp);
}
$pieces = explode("\r\n\r\n", $html,2);
$headerinfo = $pieces[0];
unset($pieces);
return $headerinfo;
}
function getsatuscode($header) {
$headers = explode( "\r\n", $header );
for( $i=0;isset( $headers[$i] );$i++ ) {
if( preg_match( "/HTTP\/[0-9A-Za-z +]/i",$headers[$i] ) ) {
$status = preg_replace( "/http\/[0-9]\.[0-9]/i","",$headers[$i] );
}
}
$rules = $pieces[1];
unset( $pieces );
return $status;
}if(isset($_GET['url'])) {
print "<div id=\"devcheckedlinks\">\n";
$done = getsatuscode(getheaders($_GET['url']));
print "Checking link: ".$_GET['url']." ...<br />\n";
print "".$done."<br /><br />\n";
@flush();
@ob_flush();
$urls = GetUniqueUrls($_GET['url']);
for($i=0;isset($urls[$i]);$i++) {
$done = getsatuscode(getheaders($urls[$i]));
$findcode200 = substr_count($done, "200"); //HTTP Status Code - 200 OK
$findcode302 = substr_count($done, "302"); //HTTP Status Code - 302 Found
$findcode307 = substr_count($done, "307"); //HTTP Status Code - 307 Temporary Redirect
if ($findcode200==1 戌 $findcode302==1 戌 $findcode307==1) {
$done = '<font color="green">' . $done . ' / </font>';
$printstuff2 = "";
}
else {
$done = '<br>' . $done;
$printstuff2 = $urls[$i]." ...<br>\n";
}
print $done."\n";
print $printstuff2;
@flush();
@ob_flush();
sleep(3);
}
print "</div><br /><br /><b>DONE</b>\n";
} else {
?>
Type in the box below the page
uri you want to check then click the "Check links" button. Then the link
checker will crawl your web page and get the links out of it and check them.
It will return a list of links it has checked and tell you the status.</p>
<?php
print "<form action=\"linkchecker.php\" method=\"get\">\n";
print "<p><input type=\"text\" name=\"url\" value=\"http://\" size=\"40\" />\n";
print "<input type=\"submit\" value=\"Check links\" />\n";
print "</p></form>\n";
}
?>
</body>
</html>
(Obviously the script goes all the way down till here. webmasterworld somehow doesn't want to display it as code all the way...)
Any ideas what the problem is?
[edited by: eelixduppy at 10:19 pm (utc) on Dec. 27, 2007]
[edit reason] removed specifics [/edit]
preg_match_all("多ref\=\"?'?`?([[:alnum:]:?=&@/;._-~]+)\"?'?`?夷", $html, &$matches);
A regex to do the same thing that is a little easier to read would be -
"%href=["'`]([\w:?=&@/;.~-]+)["'`]%"
<edit>
The thing with the [ code ] block stopping in the blank line you have in the code about 3 quarters of the way down. The block seems to exit at blank lines...I guess it is to allow for people that forget to end there code block.
[edited by: PHP_Chimp at 9:53 pm (utc) on Dec. 27, 2007]