Forum Moderators: coopster
$connection = fsockopen($host , 80, &$errorNumber, &$errorString, 10);
if ($connection) {
//tell server what document we want
fputs ($connection, "GET $url HTTP/1.0");
fputs ($connection, "\r\n");
//Host isn't required by HTTP/1.0 but some sites complain otherwise
fputs ($connection, "Host: $host");
fputs ($connection, "\r\n\r\n"); }
Any help is much appreciated.
The first thing that had come up to me is, i need to identify who i am.
The second thing that i thought of was, it isn'y exactly IE 6 that is fetching the urls, more like a NT4 server.
How do i identify myself as something like "ikbenhet1 crawler" or something obvious to identify my crawler, so they can realy ban my crawler if they want to.
Should/can i change the NT4 user agent to something more common that is used by spiders?
using fopen uses HTTP 1.0 and you can't retrieve the headers of the page, with CURL you can....
You can also set the useragent, referrer, or generally anything that involves HTTP 1.1 can be done.....
$ch = curl_init ($urltospider);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "ikbenhet1 crawler");
curl_setopt ($ch, CURLOPT_REFERER, "referer to a spider info page?");
curl_setopt ($ch, CURLOPT_TIMEOUT, 20);
echo $page = curl_exec ($ch);
curl_close ($ch);
This will grab the page and return the headers and document, providing the server is reachable in 20 secs. Make sure to have a referer or useragent pointing to an info page ;) Maybe lessen the chance of getting banned.
Where do i do this? I only have access to my virtual hosting, not the "root" of the server.
Sorry if it is obvious, i don't understand what they want me to do or where to do it.
sry, may not be much help here, can't remember how i got mine working ;) Do you have access to your php.ini file? do you know which php version/server you are on etc, someone reading might be able to help a whole lot more.
There should be an option in the php.ini to enable the curl extension, you just need to de-comment it.
extension=php_curl.dll
I'm always installing things localhost style so I'm not sure if youre shared hosting will be a prob...anyone else installed it?
To use PHP's CURL support you must also compile PHP --with-curl[=DIR] where DIR is the location of the directory containing the lib and include directories. In the "include" directory there should be a folder named "curl" which should contain the easy.h and curl.h files. There should be a file named "libcurl.a" located in the "lib" directory.These functions have been added in PHP 4.0.2.
You have an up to date PHP version then.....I guess you'd need access to the php.ini RE: the above.
I have no idea how virtual hosting will affect your situation but I doubt that it will stop you from running something like CURL.
if i put
-bot information page after Refferer
-ikbenhetcrawler after User-Agent
$put .= 'Host: '.$url["host"]."\r\n";
$put .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0b; Windows 98)\r\n";
$put .= "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, image/tiff, multipart/x-mixed-replace, */*\r\n";
$put .= "Accept-Charset: iso-8859-1, utf-8, iso-10646-ucs-2, macintosh, windows-1252, *\r\n";
$put .= 'Referer: '.$GLOBALS['url']."\r\n";
I'll have to crawl the directory again, to see if this actually solves the problem, i got a hunch it will, many sites standard block blank user agents by htaccess.
For now the user agent is: Mozilla/4.0 (compatible; MSIE 6.0b; Windows 98) . That would be ok i guess.
Thank you!
added> yep it works, no more access denied. Thanks.