Welcome to WebmasterWorld Guest from 54.160.177.33

Forum Moderators: rogerd

Message Too Old, No Replies

phpbb 2.0.11 and googlebot

session id hack no longer works

     
12:51 am on Jan 12, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:May 16, 2003
posts:592
votes: 0


Reffering to this thread [webmasterworld.com] which I cannot reply to(annoying btw).

This is the standard code for removing the session id:

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

Since upgrading to phpbb 2.0.11 this code does not appear to work as I now have log entries by google which have the session id appended. Which in effect results in google spidering the same pages over and over. Has anyone else noticed this since upgrading?

2:21 am on Jan 12, 2005 (gmt 0)

Preferred Member

joined:Apr 22, 2004
posts:528
votes: 0


I am not trying to be crass but why don't you ask the phpbb community?

they are great and provide awesome support

6:37 am on Jan 12, 2005 (gmt 0)

Full Member

10+ Year Member

joined:Jan 19, 2004
posts:330
votes: 0


I wouldnt mind the answer to this one too. And since I could never join the phpBB community (they never replied to my account queries) would be nice to see it here ... or at least a link to it.
5:40 pm on Jan 12, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:May 16, 2003
posts:592
votes: 0


I am not trying to be crass but why don't you ask the phpbb community?

1. This hack has been discussed here as has upgrading to 2.0.11.

2. The phpbb community is an unorganized zoo, I don't usually find it very helpful.

3:24 pm on Jan 13, 2005 (gmt 0)

Administrator

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 2, 2000
posts:9685
votes: 0


Any updates on this topic? Spiderability and indexability are major issues for many of the members here, and getting rid of the session IDs for bots is important.
7:03 pm on Jan 13, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 22, 2002
posts:2546
votes: 0


I have a fix for you folks. In sessions.php, find this function:

function append_sid($url, $non_html_amp = false)
{
...
...
}

Replace the whole function with this modified version below:



function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'slurp@inktomi.com', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}
return $url;
}

You may add as many user agents as you like. Be sure that the case is correct because this function IS CASE SENSITIVE. You can change the function strpos() to stripos() to make it case insensitive.

I tested this code by switching my user agent string in FireFox, which is really easy :)

regards,
Birdman

9:16 pm on Jan 13, 2005 (gmt 0)

New User

10+ Year Member

joined:Dec 2, 2002
posts:26
votes: 0


I have implemented the mod found in this

[able2know.com...]

thread. It addresses spiderability, linking, conversion of urls to a static form, etc. It has made a big difference in getting my forum indexed.

9:31 pm on Jan 13, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:May 16, 2003
posts:592
votes: 0


Awesome, thanks Birdman!
9:52 pm on Jan 13, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 17, 2002
posts:1181
votes: 5


"why don't you ask the phpbb community?"

I asked them and could get no reply to a simple question such as:

"does the google hack work with 2.0.11?"

Gave up there in the end.

All I know now is that Googlebot has only visited 3 messages since I upgraded. Googlebot has previously indexed near 20,000 prior to that. I don't know whether it's not the right time to do the deep crawling or if 2.0.11 breaks the sid removal.

--

Birdman, I tried your hack. I don't know if I did a typo or not (cut and past of the thread posted all code on one line), but new visitors to the messageboard are being greeted with a blank page. Only if they refresh the page does the messageboard then appear.

Maybe it is also due to the fact that I changed the strpos to stripos and added a few more bots?

slurp@inktomi is out of date now isn't? Hasn't it been replaced by "Yahoo! slurp"?

I added 'Yahoo! slurp' to the list. Does the! or the space need slashing?

Cheers.

10:20 pm on Jan 13, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 22, 2002
posts:2546
votes: 0


No escaping of the user agent strings is needed. Let's see what you have after your changes.

I'm not sure on the blank page deal. I would expect an error to show on a complete script abort. Once again, let's see your code and I'll test it on my server.

6:26 am on Jan 14, 2005 (gmt 0)

Full Member

10+ Year Member

joined:Jan 19, 2004
posts:330
votes: 0


Great stuff, mod implemented, seems to be running fine, we'll have a look when Googlebot spiders next.
9:28 am on Jan 14, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 17, 2002
posts:1181
votes: 5


This is the code I had previously which worked prior to 2.0.11 (and maybe still does but google hasn't crawled the messageboard for 3 weeks...)

function append_sid($url, $non_html_amp = false)
{

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

{
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}

return $url;

}

Here's your version with a couple of changes (extra bots, stripos used)


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( stripos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . SID;
}
return $url;
}

If I use that code, and then visit the messageboard the page is completely blank:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY></BODY></HTML>

If I then click Refresh I see the full messageboard.

12:10 pm on Jan 14, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 22, 2002
posts:2546
votes: 0


Frank_rizzo, I now realize what's wrong. I just tested with stripos and got an error(undefined function: stripos()). Turns out it's a PHP5 only function. Here's a different version that is still case insensitive.


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

This one should work. :)

10:38 am on Jan 15, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 17, 2002
posts:1181
votes: 5


That's more like it. Cheers!

BTW, I think the original google mod does still work as the log file is now showing googlebot retrieving some messageboard messages. Clearly it doesn't need to deep crawl the whole messageboard and is just picking a few.

I've implimented your version of the mod now. Will let you know how it goes.

7:25 pm on Jan 17, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:May 16, 2003
posts:592
votes: 0


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

I belive that should be "$SID;"
7:50 pm on Jan 17, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 22, 2002
posts:2546
votes: 0


Yes, absolutely correct! I had it in my first one but then I copied Frank_Rizzo's and made adjustments to it, rather than my original.

Sorry guys. I better PM Rizzo..

10:20 pm on Jan 17, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 17, 2002
posts:1181
votes: 5


Thanks for the update.

I wondered why I couldn't access the control panel anymore. That typo had no effect on regular browsing but caused a weird frame error when accessing the acp!

10:23 pm on Jan 17, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 17, 2002
posts:1181
votes: 5


Just checking the case of the missing $ .

Hmm, I must have hit a delete when repaginating (?) the cut 'n pasted code. re: it was all on one long line.

Ooops.

9:54 pm on Feb 10, 2005 (gmt 0)

New User

10+ Year Member

joined:Feb 10, 2005
posts:6
votes: 0


After wondering why the above code STILL didn't work, I discovered someone didn't do his homework :) The code fails at strpos ( $agent, $ref ), which is like looking for a haystack in a needle... Ahem!

So to correct the above:

function append_sid($url, $non_html_amp = false)  
{
global $SID;

if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos ( $ref, $agent )!== false )
{
return $url;
}
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . $SID;
}

return $url;
}
2:38 am on Feb 28, 2005 (gmt 0)

New User

10+ Year Member

joined:Feb 28, 2005
posts:5
votes: 0


I'm using this one:
[phpbb.com...]

Along with this (left the sessions.php part out because the above one already does that).
[able2know.com...]
Works great.
Even got it to work with the categories hierarchy (v2.0.4) properly.
Now it's only a matter of waiting and checking the logs to see what the bots are doing :)