Forum Moderators: coopster

Message Too Old, No Replies

?PHPSESSID, Cloaking, and Spiders

Another post about getting dynamic pages spidered

         

timster

2:09 pm on Dec 3, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got a site that uses Apache redirects to convince search engines to index dynamic pages, like so:

www.mysite.net/123_Stuff_from_db

Google is not biting, but I've had some luck with the new MSN search. But the MSN spider often lists pages with session ID's like so:

www.mysite.net/123_Stuff_from_db?PHPSESSID=f0oba7...

I guess that might be why Goolge and other spiders eschew the pages. But I'm a little reluctant to break the site for people with cookies turned off, like this:

ini_set('session.use_trans_sid', false);

I'm thinking of using cloaking to turn off session ID's in URL's for visiting spiders. Is this worth it, or is it better just to require users to have cookies on?

ergophobe

4:45 am on Dec 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it really cloaking if all you're doing is serving up links without a session id? I suppose it is in a very technical sense - you are serving different content to the SE. But the meat of the content is the same. Are you worried about penalties? About the hassle of sniffing all the important bots?

Personally, I restrict cookies a fair bit and will often just leave sites that require them "unnecessarily" (in my opinion anyway).

Tom

timster

11:22 pm on Dec 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you worried about penalties? About the hassle of sniffing all the important bots?

I'm worried a little about penalties (especially if I make a mistake) and weighing that against how many visitors might be turned off by a cookies requirement. (Hmm...and here I thought I was asking a PHP question, go figure.)

Um...should I be worried about the hassle of sniffing all the bots? I was figuring on checking my logs and accomodating any spider kind enough to visit my home page.

But getting back to PHP, how would I accomplish this trick? Do I need to call the ini_set command before

session_start
?

ergophobe

4:18 am on Dec 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was maybe thinking too simplistically. I was assuming that the script would still run without a session running, but that it would just use default values and would only let the bot into those parts of the site that don't require user login or whatever.

I was thinking that you would just sniff for the googlebot using $_SERVER['HTTP_USER_AGENT'] and then simply not start the session at all if it's the google bot.

if (!substr_count($_SERVER['HTTP_USER_AGENT'], "unique string for google bot"))
{
session_start();
}

Could that work?

Tom

timster

4:04 pm on Dec 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was thinking that you would just sniff for the googlebot using $_SERVER['HTTP_USER_AGENT'] and then simply not start the session at all if it's the google bot.

OK, yes, that makes sense -- I was making it too complicated.

There's no purpose of shutting off the trans_id for bots: they don't use cookies, so that's effectively the same thing as shutting off the session entirely. (Right?)

So I'll have to make sure the content I'd like indexed can be reached without sessions, and then conditionally start the session for everyone except the bots I like.

Many thanks, Tom!

ergophobe

7:48 pm on Dec 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, that's what I was thinking. Either accept the session id in the URL or you accept that there will be no session data available to bots. Once you've accepted the latter, there's no reason to start sessions at all.

The good thing about that is that it only affects bots. It doesn't affect users with cookie blocking.

The bad thing is that you'll need to create a list of all the bots that you want to let in. Presumably you have enough log data already, though, to get the big ones right off.

Cheers,

Tom

elklabone

3:22 pm on Sep 14, 2005 (gmt 0)

10+ Year Member



What about doing it this way:

<?
if (preg_match("/Mozilla/i", "$HTTP_USER_AGENT")){
session_start();
}
?>

In other words, sessions will only start if the browser HTTP_USER_AGENT contains “Mozilla” somewhere in the identifier. All versions of Netscape/Mozilla and Internet IE do this.

Not sure about Opera and Safari...

--Mark

ergophobe

4:24 pm on Sep 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This thread was locked, so Timster sent me this via sticky to post. It's his solution:

********************* Timster's Post ********************

Here's the code, such as it is -- nothing to write home about. It's an incomplete list of spiders, and would clearly be more future-proof if it used regular expressions instead of substr.

function is_spider() {

if (substr_count($_SERVER['HTTP_USER_AGENT'], ".googlebot.com")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], ".inktomisearch.com")) return 1;

# Comment if MSN indexing falls apart
if (substr_count($_SERVER['HTTP_USER_AGENT'], "+http://search.msn.com/msnbot.htm")) return 1;

if (substr_count($_SERVER['HTTP_USER_AGENT'], "Yahoo! Slurp; [help.yahoo.com...] return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "(compatible; Ask Jeeves/Teoma)")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], ".googlebot.com")) return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "(compatible; grub-client-1.4.3; Crawl your own stuff with [grub.org)"))...] return 1;
if (substr_count($_SERVER['HTTP_USER_AGENT'], "Wget/1.9")) return 1;

if (substr_count($_SERVER['HTTP_USER_AGENT'], "W3C_Validator/1.305.2.148 libwww-perl/5.800")) return 1;

# if (substr_count($_SERVER['REMOTE_HOST'], ".w3.org")) return 1;

return 0;
}

madmac

5:05 pm on Sep 14, 2005 (gmt 0)

10+ Year Member



>> In other words, sessions will only start if the browser HTTP_USER_AGENT contains “Mozilla”

Some spiders also contain Mozilla in the user agent. And not all browsers do.