homepage Welcome to WebmasterWorld Guest from 54.196.198.213
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / WebmasterWorld / Community Building and User Generated Content
Forum Library, Charter, Moderators: rogerd

Community Building and User Generated Content Forum

    
phpbb 2.0.11 and googlebot
session id hack no longer works
Reflection




msg:1561745
 12:51 am on Jan 12, 2005 (gmt 0)

Reffering to this thread [webmasterworld.com] which I cannot reply to(annoying btw).

This is the standard code for removing the session id:

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

Since upgrading to phpbb 2.0.11 this code does not appear to work as I now have log entries by google which have the session id appended. Which in effect results in google spidering the same pages over and over. Has anyone else noticed this since upgrading?

 

vabtz




msg:1561746
 2:21 am on Jan 12, 2005 (gmt 0)

I am not trying to be crass but why don't you ask the phpbb community?

they are great and provide awesome support

buksida




msg:1561747
 6:37 am on Jan 12, 2005 (gmt 0)

I wouldnt mind the answer to this one too. And since I could never join the phpBB community (they never replied to my account queries) would be nice to see it here ... or at least a link to it.

Reflection




msg:1561748
 5:40 pm on Jan 12, 2005 (gmt 0)

I am not trying to be crass but why don't you ask the phpbb community?

1. This hack has been discussed here as has upgrading to 2.0.11.

2. The phpbb community is an unorganized zoo, I don't usually find it very helpful.

rogerd




msg:1561749
 3:24 pm on Jan 13, 2005 (gmt 0)

Any updates on this topic? Spiderability and indexability are major issues for many of the members here, and getting rid of the session IDs for bots is important.

Birdman




msg:1561750
 7:03 pm on Jan 13, 2005 (gmt 0)

I have a fix for you folks. In sessions.php, find this function:

function append_sid($url, $non_html_amp = false)
{
...
...
}

Replace the whole function with this modified version below:



function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'slurp@inktomi.com', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}
return $url;
}

You may add as many user agents as you like. Be sure that the case is correct because this function IS CASE SENSITIVE. You can change the function strpos() to stripos() to make it case insensitive.

I tested this code by switching my user agent string in FireFox, which is really easy :)

regards,
Birdman

Swordsman




msg:1561751
 9:16 pm on Jan 13, 2005 (gmt 0)

I have implemented the mod found in this

[able2know.com...]

thread. It addresses spiderability, linking, conversion of urls to a static form, etc. It has made a big difference in getting my forum indexed.

Reflection




msg:1561752
 9:31 pm on Jan 13, 2005 (gmt 0)

Awesome, thanks Birdman!

Frank_Rizzo




msg:1561753
 9:52 pm on Jan 13, 2005 (gmt 0)

"why don't you ask the phpbb community?"

I asked them and could get no reply to a simple question such as:

"does the google hack work with 2.0.11?"

Gave up there in the end.

All I know now is that Googlebot has only visited 3 messages since I upgraded. Googlebot has previously indexed near 20,000 prior to that. I don't know whether it's not the right time to do the deep crawling or if 2.0.11 breaks the sid removal.

--

Birdman, I tried your hack. I don't know if I did a typo or not (cut and past of the thread posted all code on one line), but new visitors to the messageboard are being greeted with a blank page. Only if they refresh the page does the messageboard then appear.

Maybe it is also due to the fact that I changed the strpos to stripos and added a few more bots?

slurp@inktomi is out of date now isn't? Hasn't it been replaced by "Yahoo! slurp"?

I added 'Yahoo! slurp' to the list. Does the! or the space need slashing?

Cheers.

Birdman




msg:1561754
 10:20 pm on Jan 13, 2005 (gmt 0)

No escaping of the user agent strings is needed. Let's see what you have after your changes.

I'm not sure on the blank page deal. I would expect an error to show on a complete script abort. Once again, let's see your code and I'll test it on my server.

buksida




msg:1561755
 6:26 am on Jan 14, 2005 (gmt 0)

Great stuff, mod implemented, seems to be running fine, we'll have a look when Googlebot spiders next.

Frank_Rizzo




msg:1561756
 9:28 am on Jan 14, 2005 (gmt 0)

This is the code I had previously which worked prior to 2.0.11 (and maybe still does but google hasn't crawled the messageboard for 3 weeks...)

function append_sid($url, $non_html_amp = false)
{

global $SID, $HTTP_SERVER_VARS;

if (!empty($SID) &&!eregi('sid=', $url) &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') &&!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;'))

{
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . $SID;
}

return $url;

}

Here's your version with a couple of changes (extra bots, stripos used)


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( stripos( $agent, $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&' ) : '?' ) . SID;
}
return $url;
}

If I use that code, and then visit the messageboard the page is completely blank:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY></BODY></HTML>

If I then click Refresh I see the full messageboard.

Birdman




msg:1561757
 12:10 pm on Jan 14, 2005 (gmt 0)

Frank_rizzo, I now realize what's wrong. I just tested with stripos and got an error(undefined function: stripos()). Turns out it's a PHP5 only function. Here's a different version that is still case insensitive.


function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

This one should work. :)

Frank_Rizzo




msg:1561758
 10:38 am on Jan 15, 2005 (gmt 0)

That's more like it. Cheers!

BTW, I think the original google mod does still work as the log file is now showing googlebot retrieving some messageboard messages. Clearly it doesn't need to deep crawl the whole messageboard and is just picking a few.

I've implimented your version of the mod now. Will let you know how it goes.

Reflection




msg:1561759
 7:25 pm on Jan 17, 2005 (gmt 0)

function append_sid($url, $non_html_amp = false)
{
global $SID;
if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo! Slurp', 'msnbot', 'ia_archiver', 'Gigabot', 'appie', 'seekbot', 'sensis', 'scooter', 'mirago');
$ref = strtolower($_SERVER['HTTP_USER_AGENT']);
foreach ( $agents as $agent )
{
if ( strpos( strtolower($agent), $ref )!== false ) { return $url; }
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . SID;
}
return $url;
}

I belive that should be "$SID;"

Birdman




msg:1561760
 7:50 pm on Jan 17, 2005 (gmt 0)

Yes, absolutely correct! I had it in my first one but then I copied Frank_Rizzo's and made adjustments to it, rather than my original.

Sorry guys. I better PM Rizzo..

Frank_Rizzo




msg:1561761
 10:20 pm on Jan 17, 2005 (gmt 0)

Thanks for the update.

I wondered why I couldn't access the control panel anymore. That typo had no effect on regular browsing but caused a weird frame error when accessing the acp!

Frank_Rizzo




msg:1561762
 10:23 pm on Jan 17, 2005 (gmt 0)

Just checking the case of the missing $ .

Hmm, I must have hit a delete when repaginating (?) the cut 'n pasted code. re: it was all on one long line.

Ooops.

Erwin_D




msg:1561763
 9:54 pm on Feb 10, 2005 (gmt 0)

After wondering why the above code STILL didn't work, I discovered someone didn't do his homework :) The code fails at strpos ( $agent, $ref ), which is like looking for a haystack in a needle... Ahem!

So to correct the above:

function append_sid($url, $non_html_amp = false)  
{
global $SID;

if (!empty($SID) &&!preg_match('#sid=#', $url) )
{
$agents = array('Googlebot', 'Yahoo', 'Msnbot');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos ( $ref, $agent )!== false )
{
return $url;
}
}
$url .= ( ( strpos($url, '?')!= false )? ( ( $non_html_amp )? '&' : '&amp;' ) : '?' ) . $SID;
}

return $url;
}

993ti




msg:1561764
 2:38 am on Feb 28, 2005 (gmt 0)

I'm using this one:
[phpbb.com...]

Along with this (left the sessions.php part out because the above one already does that).
[able2know.com...]
Works great.
Even got it to work with the categories hierarchy (v2.0.4) properly.
Now it's only a matter of waiting and checking the logs to see what the bots are doing :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Community Building and User Generated Content
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved