Forum Moderators: open
Althought it is the deepbot, it crawled just my homepage and the pages linked by the homepage. Usually this is the behaviour of freshbot.
I've noticed the same things. Deepbot doesn't appear to be going too deep right now, although it's going a little deeper on my site than freshbot usually does.
Just was wondering what programs you guys are using to be
able to know instantly when the Googlebot is crawling your site?
I browse my raw log files, looking for googlebot.
216.xxx.xxx.xxx IPs are deepbot
64.xxx.xxx.xxx IOs are freshbot
That is, unless something has changed this month ;)
I'm sure I picked up this code here on the forum, but I'm not sure where...here it is. It emails me and it adds a line to googlebot.txt, a text file on the server. Remember to chmod it for write access.
<?
if(eregi("googlebot",$HTTP_USER_AGENT))
{
if ($QUERY_STRING!= "")
{$url = "http://".$SERVER_NAME.$PHP_SELF.'?'.$QUERY_STRING;}
else
{$url = "http://".$SERVER_NAME.$PHP_SELF;}
$today = date("F j, Y, g:i a");
$host = gethostbyaddr($REMOTE_ADDR);
mail("myemail@mydomain.org", "googlebot detected on $SERVER_NAME", "$today - Google crawled $url \n $host");
$logfile = @fopen('googlebot.txt', 'a');
@fputs($logfile, "$today - Google crawled $url$host\n");
@fclose($logfile);
}
?>
One thing I did notice, googlebot is following links that
include a session tag this month. (www.foobar.com/list.html?id=112233)
They didnt use to do this and it is creating many repetitive
hits. But, they are hitting all my other pages too. (it is
a retail site with 60k products, each with its own page...)
Any one know of a way I can have googlebot follow only links
that do not have the session tag?
thanks!
Googlebot:
216.239.46.55 GET /suntan.htm Googlebot/2.1 +(+http://www.googlebot.com/bot.html)
Freshbot:
64.68.82.17 GET /shades.htm Googlebot/2.1 +(+http://www.googlebot.com/bot.html)
Imagebot:
64.68.86.59 GET /mai-tai.jpg Googlebot/2.1 +(+http://www.googlebot.com/bot.html)
DavidTAnother thing, is Freshbot stupid? It can't seem to understand a 301 redirect from www.domain.com to domain.com. Keeps requesting pages and nothing but 301 in return.
I would like to know about this as well, google isn't following my server redirects and keeps trying "www." (trying to concentrate my PR and i just plain hate the "www." on domains ;-)