Forum Moderators: coopster
the forumx index page is the starting point of the crawl, any links pointing to within forumx are inserted into the DB and await spidering.
Each URL in the DB that is set to NULL is awaiting spidering, if it has already been spidered and indexed, it will be set to 1 instead.
So the script does this...
- Grabs the first URL in the DB that is not NULL
- Seeks the page, if the status code returned in the server header is 200, continue parsing the doc
- Split page into server headers, body headers and body of document
- extract all links that point to a file within the /forumx/ folder and remove fluff from preg_match
- insert all URL's to /forumx/ found on the page into the DB
- insert entire page contents into DB
This is the problem. The entire document is held in the variable $page right from the beginning.
$page should be inserted into the DB alongside the links, but for some reason, $page only gets inserted about 50% of the time.
i.e. the spider has went through 1000 pages but only 500 of the DB records contain the contents of $page.
I can't for the life of me understand why $page is not being inserted all of the time, and why its sporadically being inserted with each page it finds.
As you can see in the script below, the two INSERT statements could be merged into one. I previously had this
mysql_query("UPDATE `categories` SET visited = '1',pageinfo='$page' WHERE `id` = '$id' LIMIT 1;");
But the end result was no better. On top of $page not being inserted all the time, "visited" was not getting updated from NULL to 1.
Am I right in thinking there is no syntax error in the code below? The more this doesnt work 100% the more I would like to believe it is the software at fault :)
<?php
// Fetch the document with CURL/HTTP1.1
$result = mysql_query("SELECT id,url FROM categories WHERE visited IS NULL LIMIT 1");
$array = mysql_fetch_array($result);
$url = $array['url'];
$id = $array['id'];
$ch = curl_init ("http://site.org$url");
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec ($ch);
curl_close ($ch);
// Check status code of document
preg_match("/^HTTP\/\d\.\d (.{3})/A",$page,$matches);
$statuscode = $matches[1];
if($statuscode == 200)
{
// CONTINUE PARSING THE DOCUMENT
$headers = preg_split ("/<html/ims", $page);
$doc = preg_split ("/<body/ims", $headers[1]);
$linkvolume = preg_match_all("'href=\"/certain/folder[^\"]*\"'ims",$doc[1],$matches);
foreach($matches[0] as $link)
{
$link = preg_replace("/^href=\"¦\"$/","",$link);
$depth = substr_count($link, "/") - 2;
mysql_query("INSERT INTO categories (`url`,`depth`) VALUES ('$link','$depth')");
}
mysql_query("UPDATE `categories` SET visited = '1' WHERE `id` = '$id' LIMIT 1;");
mysql_query("UPDATE `categories` SET pageinfo = '$page' WHERE `id` = '$id' LIMIT 1;");
$a = '';
// echo $doc[1];
}
elseif($statuscode[0] == 4¦$statuscode == 5)
{
// 404 or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
elseif($statuscode[0] == 3)
{
// redirect or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
else
{
// Delete URL from queue since it provided an unusual server status code
$a = 'FUNNY STATUS CODE - DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
?>
<html>
<head>
<meta http-equiv="refresh" content="1;url=http://localhost/thispage.htm">
</head>
<body>
<?= $a,$url;?>
</body>
</html>
If anyone cant point out the seemingly obvious error thats stopping this working OK I would be greatful for you to post :)
Cheers
Richard
this looks like the second sql query does not execute in about 500 cases.
1.) check the result of the query
2.) if it's false, echo out mysql_error() to get the error message.
then you maybe realize, that you have to escape $page before you put it into the db, because of this ' little char:
mysql_query("UPDATE `categories` SET pageinfo = '".escape($page)."' WHERE `id` = '$id' LIMIT 1;");
1) Check the result
I changed the PHP_INI to report all errors, im not getting an error when the page is not getting inserted, while $page does indeed exist.
2) escape
Cool, I'll have to try that. I was doing it like this just now
mysql_query("UPDATE `categories` SET pageinfo = '<!-- $page -->' WHERE `id` = '$id' LIMIT 1;");
just in case it was the HTML.
The "crawler" is halfway through grabbing links, I will try your suggestion when it runs out of breath :)
$r = mysql_query(...);
if (!$r) { echo mysql_error(); }
then you'll get a clear answer, what happens with your query. don't make 1000s of requests only to trouble-shoot.
You have an error in your SQL syntax near '1'
and
query was empty
that's my alternating error messages...something to work on now....
wierd thing is it seems to be inserting all the data that its meant to now.
Something to debug before it goes 'live' anyway!
Cheers for the pointer
I never used to hit probs with this syntax....
The script is still running, just noticed an extra line....
mysql_query("UPDATE `categories` SET visited = '1' WHERE `id` = '$id' LIMIT 1;");
Seems that "visited" doesnt get updated using the first query, as the spider sticks to requesting the same page...after removing the 2nd query
However, $page is now being inserted, and still displaying the errors mentioned above ;)
I usually just copy and paste my SQL statements from phpmyadmin and put in my variables.....this is the first time ive had a prob borrowing their queries...
//added
i tried your statement...something about escape not being a function? probably a missing apostrophe or something no doubt.
hakre, ill send you a sticky of exactly what im doing, you can see for yourself ...
Andreas