Forum Moderators: coopster

Message Too Old, No Replies

PHP/Curl/mysql

Which one is misbehaving, or is it me

         

brotherhood of LAN

8:36 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a makeshift script here that is intended to strip pages of links within a certain directory on the website, ie webmasterworld.com/forumx

the forumx index page is the starting point of the crawl, any links pointing to within forumx are inserted into the DB and await spidering.

Each URL in the DB that is set to NULL is awaiting spidering, if it has already been spidered and indexed, it will be set to 1 instead.

So the script does this...

- Grabs the first URL in the DB that is not NULL
- Seeks the page, if the status code returned in the server header is 200, continue parsing the doc
- Split page into server headers, body headers and body of document
- extract all links that point to a file within the /forumx/ folder and remove fluff from preg_match
- insert all URL's to /forumx/ found on the page into the DB
- insert entire page contents into DB

This is the problem. The entire document is held in the variable $page right from the beginning.

$page should be inserted into the DB alongside the links, but for some reason, $page only gets inserted about 50% of the time.

i.e. the spider has went through 1000 pages but only 500 of the DB records contain the contents of $page.

I can't for the life of me understand why $page is not being inserted all of the time, and why its sporadically being inserted with each page it finds.

As you can see in the script below, the two INSERT statements could be merged into one. I previously had this
mysql_query("UPDATE `categories` SET visited = '1',pageinfo='$page' WHERE `id` = '$id' LIMIT 1;");

But the end result was no better. On top of $page not being inserted all the time, "visited" was not getting updated from NULL to 1.

Am I right in thinking there is no syntax error in the code below? The more this doesnt work 100% the more I would like to believe it is the software at fault :)

<?php
// Fetch the document with CURL/HTTP1.1
$result = mysql_query("SELECT id,url FROM categories WHERE visited IS NULL LIMIT 1");
$array = mysql_fetch_array($result);
$url = $array['url'];
$id = $array['id'];

$ch = curl_init ("http://site.org$url");
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec ($ch);
curl_close ($ch);

// Check status code of document
preg_match("/^HTTP\/\d\.\d (.{3})/A",$page,$matches);
$statuscode = $matches[1];

if($statuscode == 200)
{
// CONTINUE PARSING THE DOCUMENT
$headers = preg_split ("/<html/ims", $page);
$doc = preg_split ("/<body/ims", $headers[1]);

$linkvolume = preg_match_all("'href=\"/certain/folder[^\"]*\"'ims",$doc[1],$matches);
foreach($matches[0] as $link)
{
$link = preg_replace("/^href=\"¦\"$/","",$link);
$depth = substr_count($link, "/") - 2;
mysql_query("INSERT INTO categories (`url`,`depth`) VALUES ('$link','$depth')");
}
mysql_query("UPDATE `categories` SET visited = '1' WHERE `id` = '$id' LIMIT 1;");
mysql_query("UPDATE `categories` SET pageinfo = '$page' WHERE `id` = '$id' LIMIT 1;");

$a = '';

// echo $doc[1];
}
elseif($statuscode[0] == 4¦$statuscode == 5)
{
// 404 or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
elseif($statuscode[0] == 3)
{
// redirect or similar, do stuff
$a = 'DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
else
{
// Delete URL from queue since it provided an unusual server status code
$a = 'FUNNY STATUS CODE - DELETED';
mysql_query("DELETE FROM categories WHERE url='$url'");
}
?>
<html>
<head>
<meta http-equiv="refresh" content="1;url=http://localhost/thispage.htm">
</head>
<body>

<?= $a,$url;?>

</body>
</html>

If anyone cant point out the seemingly obvious error thats stopping this working OK I would be greatful for you to post :)

Cheers
Richard

hakre

9:05 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hi richard,

this looks like the second sql query does not execute in about 500 cases.

1.) check the result of the query
2.) if it's false, echo out mysql_error() to get the error message.

then you maybe realize, that you have to escape $page before you put it into the db, because of this ' little char:


mysql_query("UPDATE `categories` SET pageinfo = '".escape($page)."' WHERE `id` = '$id' LIMIT 1;");

brotherhood of LAN

9:11 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hey hakre

1) Check the result

I changed the PHP_INI to report all errors, im not getting an error when the page is not getting inserted, while $page does indeed exist.

2) escape

Cool, I'll have to try that. I was doing it like this just now
mysql_query("UPDATE `categories` SET pageinfo = '<!-- $page -->' WHERE `id` = '$id' LIMIT 1;");

just in case it was the HTML.

The "crawler" is halfway through grabbing links, I will try your suggestion when it runs out of breath :)

hakre

9:28 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but don't fish in the darkness. it's not only an error reporting of php, you can check if mysql was succesfull executing the query:

$r = mysql_query(...);
if (!$r) { echo mysql_error(); }

then you'll get a clear answer, what happens with your query. don't make 1000s of requests only to trouble-shoot.

brotherhood of LAN

10:34 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



fishing with a torch now ;)

You have an error in your SQL syntax near '1'
and
query was empty

that's my alternating error messages...something to work on now....

wierd thing is it seems to be inserting all the data that its meant to now.

Something to debug before it goes 'live' anyway!

Cheers for the pointer

hakre

10:42 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



can you post the code of the mysql_query() again?

brotherhood of LAN

10:56 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$result = mysql_query("UPDATE `categories` SET `visited` = '1', `pageinfo` = '<!-- $page -->' WHERE `id` = '$id' LIMIT 1;");

I never used to hit probs with this syntax....

The script is still running, just noticed an extra line....
mysql_query("UPDATE `categories` SET visited = '1' WHERE `id` = '$id' LIMIT 1;");

Seems that "visited" doesnt get updated using the first query, as the spider sticks to requesting the same page...after removing the 2nd query

However, $page is now being inserted, and still displaying the errors mentioned above ;)

I usually just copy and paste my SQL statements from phpmyadmin and put in my variables.....this is the first time ive had a prob borrowing their queries...

//added
i tried your statement...something about escape not being a function? probably a missing apostrophe or something no doubt.

hakre, ill send you a sticky of exactly what im doing, you can see for yourself ...

andreasfriedrich

11:08 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, escape does not work because there is no such function in PHP. Use either addslashes [php.net] or mysql_escape_string [php.net] to escape all special characters.

Andreas