Forum Moderators: coopster
www.mysite.net/123_Stuff_from_db
Google is not biting, but I've had some luck with the new MSN search. But the MSN spider often lists pages with session ID's like so:
www.mysite.net/123_Stuff_from_db?PHPSESSID=f0oba7...
I guess that might be why Goolge and other spiders eschew the pages. But I'm a little reluctant to break the site for people with cookies turned off, like this:
ini_set('session.use_trans_sid', false); I'm thinking of using cloaking to turn off session ID's in URL's for visiting spiders. Is this worth it, or is it better just to require users to have cookies on?
Personally, I restrict cookies a fair bit and will often just leave sites that require them "unnecessarily" (in my opinion anyway).
Tom
Are you worried about penalties? About the hassle of sniffing all the important bots?
I'm worried a little about penalties (especially if I make a mistake) and weighing that against how many visitors might be turned off by a cookies requirement. (Hmm...and here I thought I was asking a PHP question, go figure.)
Um...should I be worried about the hassle of sniffing all the bots? I was figuring on checking my logs and accomodating any spider kind enough to visit my home page.
But getting back to PHP, how would I accomplish this trick? Do I need to call the ini_set command before
session_start?
I was thinking that you would just sniff for the googlebot using $_SERVER['HTTP_USER_AGENT'] and then simply not start the session at all if it's the google bot.
if (!substr_count($_SERVER['HTTP_USER_AGENT'], "unique string for google bot"))
{
session_start();
}
Could that work?
Tom
I was thinking that you would just sniff for the googlebot using $_SERVER['HTTP_USER_AGENT'] and then simply not start the session at all if it's the google bot.
OK, yes, that makes sense -- I was making it too complicated.
There's no purpose of shutting off the trans_id for bots: they don't use cookies, so that's effectively the same thing as shutting off the session entirely. (Right?)
So I'll have to make sure the content I'd like indexed can be reached without sessions, and then conditionally start the session for everyone except the bots I like.
Many thanks, Tom!
The good thing about that is that it only affects bots. It doesn't affect users with cookie blocking.
The bad thing is that you'll need to create a list of all the bots that you want to let in. Presumably you have enough log data already, though, to get the big ones right off.
Cheers,
Tom
<?
if (preg_match("/Mozilla/i", "$HTTP_USER_AGENT")){
session_start();
}
?>
In other words, sessions will only start if the browser HTTP_USER_AGENT contains “Mozilla” somewhere in the identifier. All versions of Netscape/Mozilla and Internet IE do this.
Not sure about Opera and Safari...
--Mark
********************* Timster's Post ********************
Here's the code, such as it is -- nothing to write home about. It's an incomplete list of spiders, and would clearly be more future-proof if it used regular expressions instead of substr.
function is_spider() {
if (substr_count($_SERVER['HTTP_USER_AGENT'], ".googlebot.com")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], ".inktomisearch.com")) return 1;
# Comment if MSN indexing falls apart
if (substr_count($_SERVER['HTTP_USER_AGENT'], "+http://search.msn.com/msnbot.htm")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "Yahoo! Slurp; [help.yahoo.com...] return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "(compatible; Ask Jeeves/Teoma)")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], ".googlebot.com")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "(compatible; grub-client-1.4.3; Crawl your own stuff with [grub.org)"))...] return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "Wget/1.9")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "W3C_Validator/1.305.2.148 libwww-perl/5.800")) return 1;
# if (substr_count($_SERVER['REMOTE_HOST'], ".w3.org")) return 1;
return 0;
}