my own php search engine

Forum Moderators: coopster

Message Too Old, No Replies

my own php search engine

access control configuration, spidering errors

ikbenhet1

1:06 pm on Jul 22, 2003 (gmt 0)

I have made a very simplistic search engine. I only crawl sites that are in the search directory + 1 level up.
Unfortunately some crawled urls return this message:

The following error was encountered : Access Denied . Access control configuration prevents your request from being allowed at this time . Please contact your service provider if you feel this is incorrect . Your cache administrator is webmaster .

I am 100% sure, that am allowed to spider this url (at least by the sites robots.txt).
So what can be the problem? i use this code to fetch the urls.

$connection = fsockopen($host , 80, &$errorNumber, &$errorString, 10);
if ($connection) {
//tell server what document we want
fputs ($connection, "GET $url HTTP/1.0");
fputs ($connection, "\r\n");
//Host isn't required by HTTP/1.0 but some sites complain otherwise
fputs ($connection, "Host: $host");
fputs ($connection, "\r\n\r\n"); }

Any help is much appreciated.

ukgimp

1:17 pm on Jul 22, 2003 (gmt 0)

bit of a guess but that sounds lie you have been barred in some way or another. some webmaster here lock a site down if it is getting multiple hits from agents that appear to be misbeahving or ripping a site.

ikbenhet1

1:28 pm on Jul 22, 2003 (gmt 0)

Yes, that sounds like a very good possibility.

The first thing that had come up to me is, i need to identify who i am.
The second thing that i thought of was, it isn'y exactly IE 6 that is fetching the urls, more like a NT4 server.

How do i identify myself as something like "ikbenhet1 crawler" or something obvious to identify my crawler, so they can realy ban my crawler if they want to.

Should/can i change the NT4 user agent to something more common that is used by spiders?

Glacai

1:33 pm on Jul 22, 2003 (gmt 0)

You could use cURL which allows you to change ua.

brotherhood of LAN

2:47 pm on Jul 22, 2003 (gmt 0)

2nd CURL here,

using fopen uses HTTP 1.0 and you can't retrieve the headers of the page, with CURL you can....

You can also set the useragent, referrer, or generally anything that involves HTTP 1.1 can be done.....

$ch = curl_init ($urltospider);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "ikbenhet1 crawler");
curl_setopt ($ch, CURLOPT_REFERER, "referer to a spider info page?");
curl_setopt ($ch, CURLOPT_TIMEOUT, 20);
echo $page = curl_exec ($ch);
curl_close ($ch);

This will grab the page and return the headers and document, providing the server is reachable in 20 secs. Make sure to have a referer or useragent pointing to an info page ;) Maybe lessen the chance of getting banned.

ikbenhet1

3:11 pm on Jul 22, 2003 (gmt 0)

I am trying to install Curl on the unix server, and can't seem to figure out how. The documentation tells me to do this:
./configure
make
make test (optional)
make install

Where do i do this? I only have access to my virtual hosting, not the "root" of the server.
Sorry if it is obvious, i don't understand what they want me to do or where to do it.

brotherhood of LAN

3:21 pm on Jul 22, 2003 (gmt 0)

ikbenhet,

sry, may not be much help here, can't remember how i got mine working ;) Do you have access to your php.ini file? do you know which php version/server you are on etc, someone reading might be able to help a whole lot more.

There should be an option in the php.ini to enable the curl extension, you just need to de-comment it.
extension=php_curl.dll

I'm always installing things localhost style so I'm not sure if youre shared hosting will be a prob...anyone else installed it?

ikbenhet1

3:30 pm on Jul 22, 2003 (gmt 0)

The server is: Apache/1.3.27 Ben-SSL/1.48 (Unix) mod_perl/1.27 PHP/4.3.1
I browsed through all folders, but could't find php.ini .

I guess that means that i can't use the curl package, i hope i'm wrong.
Thank you for helping me, i really appriciate it.

brotherhood of LAN

3:42 pm on Jul 22, 2003 (gmt 0)

From the PHP manual, curl page

To use PHP's CURL support you must also compile PHP --with-curl[=DIR] where DIR is the location of the directory containing the lib and include directories. In the "include" directory there should be a folder named "curl" which should contain the easy.h and curl.h files. There should be a file named "libcurl.a" located in the "lib" directory.
These functions have been added in PHP 4.0.2.

You have an up to date PHP version then.....I guess you'd need access to the php.ini RE: the above.

I have no idea how virtual hosting will affect your situation but I doubt that it will stop you from running something like CURL.

ikbenhet1

5:25 pm on Jul 22, 2003 (gmt 0)

Considering the fact that i cannot use cURL, will this code do the same?

if i put
-bot information page after Refferer
-ikbenhetcrawler after User-Agent

$put .= 'Host: '.$url["host"]."\r\n";
$put .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0b; Windows 98)\r\n";
$put .= "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, image/tiff, multipart/x-mixed-replace, */*\r\n";
$put .= "Accept-Charset: iso-8859-1, utf-8, iso-10646-ucs-2, macintosh, windows-1252, *\r\n";
$put .= 'Referer: '.$GLOBALS['url']."\r\n";

brotherhood of LAN

8:35 pm on Jul 22, 2003 (gmt 0)

there's an easy way to test it, make a page with

<?= $_SERVER['HTTP_REFERER'];?>

and use your script to grab this page, you'll end up with the referer that your script used to grab the page. HTH, not sure if the code will work but it can be tested ;)

ikbenhet1

10:57 pm on Jul 22, 2003 (gmt 0)

Brotherhood_of_LAN, it works!

I'll have to crawl the directory again, to see if this actually solves the problem, i got a hunch it will, many sites standard block blank user agents by htaccess.

For now the user agent is: Mozilla/4.0 (compatible; MSIE 6.0b; Windows 98) . That would be ok i guess.

Thank you!

added> yep it works, no more access denied. Thanks.