Welcome to WebmasterWorld Guest from 23.22.46.195

Forum Moderators: rogerd

phpbb 2.0.11 and googlebot

session id hack no longer works

   
12:51 am on Jan 12, 2005 (gmt 0)

10+ Year Member



Reffering to this thread [webmasterworld.com] which I cannot reply to(annoying btw).

This is the standard code for removing the session id:

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

Since upgrading to phpbb 2.0.11 this code does not appear to work as I now have log entries by google which have the session id appended. Which in effect results in google spidering the same pages over and over. Has anyone else noticed this since upgrading?

2:21 am on Jan 12, 2005 (gmt 0)



I am not trying to be crass but why don't you ask the phpbb community?

they are great and provide awesome support

6:37 am on Jan 12, 2005 (gmt 0)

10+ Year Member



I wouldnt mind the answer to this one too. And since I could never join the phpBB community (they never replied to my account queries) would be nice to see it here ... or at least a link to it.
5:40 pm on Jan 12, 2005 (gmt 0)

10+ Year Member



I am not trying to be crass but why don't you ask the phpbb community?

1. This hack has been discussed here as has upgrading to 2.0.11.

2. The phpbb community is an unorganized zoo, I don't usually find it very helpful.

3:24 pm on Jan 13, 2005 (gmt 0)

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Any updates on this topic? Spiderability and indexability are major issues for many of the members here, and getting rid of the session IDs for bots is important.
7:03 pm on Jan 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a fix for you folks. In sessions.php, find this function:

function append_sid($url, $non_html_amp = false)
{
...
...
}

Replace the whole function with this modified version below:



function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'slurp@inktomi.com', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}
return $url;
}

You may add as many user agents as you like. Be sure that the case is correct because this function IS CASE SENSITIVE. You can change the function strpos() to stripos() to make it case insensitive.

I tested this code by switching my user agent string in FireFox, which is really easy :)

regards,
Birdman

9:16 pm on Jan 13, 2005 (gmt 0)

10+ Year Member



I have implemented the mod found in this

[able2know.com...]

thread. It addresses spiderability, linking, conversion of urls to a static form, etc. It has made a big difference in getting my forum indexed.

9:31 pm on Jan 13, 2005 (gmt 0)

10+ Year Member



Awesome, thanks Birdman!
9:52 pm on Jan 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"why don't you ask the phpbb community?"

I asked them and could get no reply to a simple question such as:

"does the google hack work with 2.0.11?"

Gave up there in the end.

All I know now is that Googlebot has only visited 3 messages since I upgraded. Googlebot has previously indexed near 20,000 prior to that. I don't know whether it's not the right time to do the deep crawling or if 2.0.11 breaks the sid removal.

--

Birdman, I tried your hack. I don't know if I did a typo or not (cut and past of the thread posted all code on one line), but new visitors to the messageboard are being greeted with a blank page. Only if they refresh the page does the messageboard then appear.

Maybe it is also due to the fact that I changed the strpos to stripos and added a few more bots?

slurp@inktomi is out of date now isn't? Hasn't it been replaced by "Yahoo! slurp"?

I added 'Yahoo! slurp' to the list. Does the! or the space need slashing?

Cheers.

10:20 pm on Jan 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No escaping of the user agent strings is needed. Let's see what you have after your changes.

I'm not sure on the blank page deal. I would expect an error to show on a complete script abort. Once again, let's see your code and I'll test it on my server.

6:26 am on Jan 14, 2005 (gmt 0)

10+ Year Member



Great stuff, mod implemented, seems to be running fine, we'll have a look when Googlebot spiders next.
9:28 am on Jan 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is the code I had previously which worked prior to 2.0.11 (and maybe still does but google hasn't crawled the messageboard for 3 weeks...)

function append_sid($url, $non_html_amp = false)
{

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

{
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}

return $url;

}

Here's your version with a couple of changes (extra bots, stripos used)


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( stripos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . SID;
}
return $url;
}

If I use that code, and then visit the messageboard the page is completely blank:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY></BODY></HTML>

If I then click Refresh I see the full messageboard.

12:10 pm on Jan 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Frank_rizzo, I now realize what's wrong. I just tested with stripos and got an error(undefined function: stripos()). Turns out it's a PHP5 only function. Here's a different version that is still case insensitive.


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

This one should work. :)

10:38 am on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's more like it. Cheers!

BTW, I think the original google mod does still work as the log file is now showing googlebot retrieving some messageboard messages. Clearly it doesn't need to deep crawl the whole messageboard and is just picking a few.

I've implimented your version of the mod now. Will let you know how it goes.

7:25 pm on Jan 17, 2005 (gmt 0)

10+ Year Member



function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

I belive that should be "$SID;"
7:50 pm on Jan 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, absolutely correct! I had it in my first one but then I copied Frank_Rizzo's and made adjustments to it, rather than my original.

Sorry guys. I better PM Rizzo..

10:20 pm on Jan 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the update.

I wondered why I couldn't access the control panel anymore. That typo had no effect on regular browsing but caused a weird frame error when accessing the acp!

10:23 pm on Jan 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just checking the case of the missing $ .

Hmm, I must have hit a delete when repaginating (?) the cut 'n pasted code. re: it was all on one long line.

Ooops.

9:54 pm on Feb 10, 2005 (gmt 0)

10+ Year Member



After wondering why the above code STILL didn't work, I discovered someone didn't do his homework :) The code fails at strpos ( $agent, $ref ), which is like looking for a haystack in a needle... Ahem!

So to correct the above:

function append_sid($url, $non_html_amp = false)  
{
global $SID;

if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos ( $ref, $agent )!== false )
{
return $url;
}
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . $SID;
}

return $url;
}
2:38 am on Feb 28, 2005 (gmt 0)

10+ Year Member



I'm using this one:
[phpbb.com...]

Along with this (left the sessions.php part out because the above one already does that).
[able2know.com...]
Works great.
Even got it to work with the categories hierarchy (v2.0.4) properly.
Now it's only a matter of waiting and checking the logs to see what the bots are doing :)

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month