PHP/Curl/mysql

I have a makeshift script here that is intended to strip pages of links within a certain directory on the website, ie webmasterworld.com/forumx

the forumx index page is the starting point of the crawl, any links pointing to within forumx are inserted into the DB and await spidering.

Each URL in the DB that is set to NULL is awaiting spidering, if it has already been spidered and indexed, it will be set to 1 instead.

So the script does this...

- Grabs the first URL in the DB that is not NULL
- Seeks the page, if the status code returned in the server header is 200, continue parsing the doc
- Split page into server headers, body headers and body of document
- extract all links that point to a file within the /forumx/ folder and remove fluff from preg_match
- insert all URL's to /forumx/ found on the page into the DB
- insert entire page contents into DB

This is the problem. The entire document is held in the variable $page right from the beginning.

$page should be inserted into the DB alongside the links, but for some reason, $page only gets inserted about 50% of the time.

i.e. the spider has went through 1000 pages but only 500 of the DB records contain the contents of $page.

I can't for the life of me understand why $page is not being inserted all of the time, and why its sporadically being inserted with each page it finds.

As you can see in the script below, the two INSERT statements could be merged into one. I previously had this
mysql_query("UPDATE `categories` SET visited = '1',pageinfo='$page' WHERE `id` = '$id' LIMIT 1;");

But the end result was no better. On top of $page not being inserted all the time, "visited" was not getting updated from NULL to 1.

Am I right in thinking there is no syntax error in the code below? The more this doesnt work 100% the more I would like to believe it is the software at fault :)

<?php
// Fetch the document with CURL/HTTP1.1
$result = mysql_query("SELECT id,url FROM categories WHERE visited IS NULL LIMIT 1");
$array = mysql_fetch_array($result);
$url = $array['url'];
$id = $array['id'];

$ch = curl_init ("http://site.org$url");
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec ($ch);
curl_close ($ch);

// Check status code of document
preg_match("/^HTTP\/\d\.\d (.{3})/A",$page,$matches);
$statuscode = $matches[1];

if($statuscode == 200)
{
// CONTINUE PARSING THE DOCUMENT
$headers = preg_split ("/<html/ims", $page);
$doc = preg_split ("/<body/ims", $headers[1]);

$linkvolume = preg_match_all("'href=\"/certain/folder[^\"]*\"'ims",$doc[1],$matches);
foreach($matches[0] as $link)
{
$link = preg_replace("/^href=\"Ś\"$/","",$link);
$depth = substr_count($link, "/") - 2;
mysql_query("INSERT INTO categories (`url`,`depth`) VALUES ('$link','$depth')");
}
mysql_query("UPDATE `categories` SET visited = '1' WHERE `id` = '$id' LIMIT 1;");
mysql_query("UPDATE `categories` SET pageinfo = '$page' WHERE `id` = '$id' LIMIT 1;");
$a = '';

// echo $doc[1];
}
elseif($statuscode[0] == 4Ś$statuscode == 5)
{
// 404 or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
elseif($statuscode[0] == 3)
{
// redirect or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
else
{
// Delete URL from queue since it provided an unusual server status code
$a = 'FUNNY STATUS CODE - DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
?>
<html>
<head>
<meta http-equiv="refresh" content="1;url=http://localhost/thispage.htm">
</head>
<body>

<?= $a,$url;?>

</body>
</html>

If anyone cant point out the seemingly obvious error thats stopping this working OK I would be greatful for you to post :)

Cheers
Richard

PHP/Curl/mysql

Which one is misbehaving, or is it me

brotherhood of LAN

hakre

brotherhood of LAN

hakre

brotherhood of LAN

hakre

brotherhood of LAN

andreasfriedrich

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week