Forum Moderators: coopster

Message Too Old, No Replies

gethostbyaddr problem

         

Patrick Taylor

10:20 pm on Nov 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




// Get host
if ($HTTP_SERVER_VARS["HTTP_X_FORWARDED_FOR"]!= ""){
$le_host = @gethostbyaddr($HTTP_SERVER_VARS["HTTP_X_FORWARDED_FOR"]);
} else {
$le_host = @gethostbyaddr($HTTP_SERVER_VARS["REMOTE_ADDR"]);
}
// Check $le_host for strings
if (strstr($le_host, "googlebot")) {
$bot = 'googlebot';
} else if (strstr($le_host, "msnbot")) {
$bot = 'msnbot';
} else if (strstr($le_host, "slurp")) {
$bot = 'slurp';
}

if (strlen($bot) > 0) {

// Connect to the database.
etc etc...

// Make the query - insert stats
$query = "INSERT INTO botstats (host, bot, date) VALUES ('$le_host', '$bot', NOW() )";

// Run the query.
@mysql_query ($query);

// Close the database connection.
mysql_close();

}

This is supposed to be a simple little script that tracks crawlers (only three thus far). For some reason nothing is going into the database, even though the pages are being crawled by msnbot etc. I would appreciate a pointer on where I'm going wrong.

Receptional Andy

11:25 am on Nov 15, 2005 (gmt 0)



Is this because you are checking REMOTE_ADDR (which is an IP) rather than the useragent? gethostbyaddr is likely to resolve to crawlx.google.com rather than Googlebot. Should you be checking $_SERVER[HTTP_USER_AGENT] instead?

Patrick Taylor

5:01 pm on Nov 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm now using:


if (getenv(HTTP_X_FORWARDED_FOR)) {
$ip = getenv('HTTP_X_FORWARD_FOR');
$host = gethostbyaddr($ip);
} else {
$ip = getenv('REMOTE_ADDR');
$host = gethostbyaddr($ip);
}

and it still doesn't work. On another tracking script I made, the example in my first post returns a "host" string like 'msnbot.msn.com' etc so I thought it should have worked.

I think I'm after the referrer's host, aren't I?

I will try your suggestion though. Thanks.

Patrick Taylor

5:52 pm on Nov 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$_SERVER[HTTP_USER_AGENT] is just the browser.

Receptional Andy

9:36 am on Nov 16, 2005 (gmt 0)



I think I'm after the referrer's host, aren't I?

I'm not sure that you are ;)

Most spiders crawl from a number of different IP addresses. One for Slurp is 66.196.65.38. If you use gethostbyaddr, this will return si1004.inktomisearch.com. Or for another example, some googlebot IPs do resolve to xx.googlebot.com, however others resolve to hostnames like a15.google.com.

The User Agent, however will almost always contain the spider name, which is what you seem to want to check for with your script.

Patrick Taylor

5:14 pm on Nov 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I seem to have this working, with


gethostbyaddr($_SERVER['REMOTE_ADDR'])

... and you're right - I am looking for the spider name. For instance 'inktomisearch' for Yahoo, 'googlebot' for Google, etc. I think when I tested for the User Agent on a similar exercise in the past I was getting 'ns4' whenever the visit was from a spider.

The code I am now using - for the host address - does seem to correlate to strings containing the name of the spider.

Thanks for the helpful response, and please correct me if I'm wrong!

Best regards.