Forum Moderators: open

Message Too Old, No Replies

Optimizing phpBB for Google

         

BartVB

7:35 pm on May 21, 2003 (gmt 0)

10+ Year Member



Does anyone here know of a better way to get into contact with the engineers at Google? I filled in their contact for 3 or 4 times over the course of several months but I never received a reply :\

I'm one of the senior developers of [url=www.phpbb.com]phpBB[/url] one of the Open Source forum systems. IMO we, as the development team, could help Google quite a bit in making the spider/index process more efficient. phpBB creates a lot of 'useless' links for the spider. An example; every post in a topic has a link that points to itself. So if you have a page like viewtopic.php?t=1234 then you'll have 25 links like: viewtopic.php?p=654321, viewtopic.php?p=654334 etc. Because that Google indexes a LOT of redundant pages. Furthermore it's fairly easy to serve a different (very minimal) template set for google to reduce parsing time and bandwith consumption, etc, etc.. But so far we haven't had any response from the Google team. IMO it would be nice if they could give us some hints, tips and pointers.

rogerd

7:58 pm on May 21, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Hey, Bart, welcome to WebmasterWorld!

This is a timely topic, as I'm currently comparing phpBB and a couple of others for a forum upgrade, and all seem to have some search engine clinkers.

I think you have come to the right place for advice on how to work with Google. Don't hold your breath waiting for Google itself to give you optimization advice, although one never knows...

My quick take on getting content indexed properly without duplication but with maximum indexing:

1) Best - rewrite the URLs to look static, like the pages on this forum. Browse around here to see how multiple page threads are handled. I realize this might create more installation difficulties, but is the cleanest solution for more than just Google.

2) Keep dynamic URLs short and not redundant. Google is getting better at indexing dynamic content, but long query strings and different strings that lead to the same content are bad.

3) Don't use session variables. These are another big problem for search engines. There have been extensive posts at phpbb.org about hacks to eliminate session variables for spiders. Eliminating them for everyone would be even better.

I'm sure others will have input. Good luck, and thanks for dropping by.

PS - note that "serving different content" is also known as cloaking. One person's helpful disntinction between visitors is another person's punishable offense. I'd get some additional input on this issue before you roll anything out.

Swamper

8:15 pm on May 21, 2003 (gmt 0)

10+ Year Member



Welcome Bart from a new user also!

Thought you might want to have a look at my latest problem. Maybe you can also help me out.

The problem I have/had is googlebot choking on session id's.
Follow my whole thread here.

[webmasterworld.com...]

It is actully a mod of phpbb for phpnuke that I use. I have implemented a hack to session.php to check user agent and disable sessions for googlebot, but it does not seem to be putting a stop to googlebot still trying to use the session id's on the url. Unless they are trying to reindex with the 80,000 plus hits to my site that they attempted this weekend.

If you can help please do!

Swamper

John_Creed

8:18 pm on May 21, 2003 (gmt 0)

10+ Year Member



I have a very large phpBB forum and would benefit from such improvements.

Maybe GoogleGuy can offer some help on this.

Swamper

8:45 pm on May 21, 2003 (gmt 0)

10+ Year Member



Also in the link in my last post here is a reply from google about disabling session id's for google and cloaking.

It all appears that it is a go and won't be penalized.

stuntdubl

8:58 pm on May 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) Best - rewrite the URLs to look static, like the pages on this forum. Browse around here to see how multiple page threads are handled. I realize this might create more installation difficulties, but is the cleanest solution for more than just Google.

If anyone finds out the best and easiest way to do this please post it here or sticky me as well. I have just created my first phpBB, but all the configuration was done by my host, so I am a bit in the dark.

I would really like to make the pages appear static as mentioned, for best "indexibility".

panos

9:38 pm on May 21, 2003 (gmt 0)

10+ Year Member



If anyone finds out the best and easiest way to do this please post it here or sticky me as well. I have just created my first phpBB, but all the configuration was done by my host, so I am a bit in the dark.

some people have allready found the solution ,

i just did a search and i found these threads

[phpbb.com...]

[phpbb.com...]

[phpbb.com...]

[phpbb.com...]

hope this helps

rogerd

9:55 pm on May 21, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



There appear to be a few useful hacks there. Maybe Bart can read up on this (including the 25 page Google hack thread) and work to incorporate appropriate logic into the original source code. Hacks are fine, but I hate having to modify software just to get basic functionality - it makes upgrading a huge hassle.

It seems like software designers operate in a search engine vacuum - I just had a shopping cart author release an "upgrade" that hoses the URLs. When I try to offer suggestions that would benefit the whole user community, I come up against the author's dated and inaccurate knowledge of how SEs work. Argghhh... (Not talking about Bart, of course! He's here at WebmasterWorld! ;))

PS - I looked at one solution that claimed to convert the dynamic URLs to static ones. Well, it works - except that a gigantic session ID is in every "static" URL! Somehow, I think the indexing of that site won't benefit from that particular hack... ;)

ruserious

10:25 pm on May 21, 2003 (gmt 0)

10+ Year Member



I am not BartVB, but reading his post I think his intentions are a bit different. It is fairly well known how to get google to index you, the only problem is the session-id (other factors mentioned here only play a minor role). The only way is the (legitimate) cloaking and dropping sids for bots.

I think what Bart is getting at is that the forum is an "application" rather than a purely static informational site. And as such it has many link, data, buttons etc. which only serve the user in using the application, but those things are "worthless baggage" for a searchengine. Hence the question on how to minimize the useless overhead that is created by making the regular site available as a whole.
I found that robots.txt alone is not a good way. Reason: Google will still serve these pages in the SERPS solely based on the anchor-text/links pointing to them, even if the pages are themselves not crawled.

Viable Solutions (again IMHO):
- only dropping the session-ID for "valuable links" that is links that vlearly identify a viewforum?f=xxx or a viewtopic?t=xxx page. Problem with viewtopic?p=xxx is that it will lead to massive duplicate content (25 posts all on the same page, make 25 different?p=xxx links that lead to the same page).
- Closing the forum-"application" to bots (as it already is through sids, but also with a nofollow, noarchive meta tag in the HTML-header) and making an archive available. This is what is/has being/been done for vB. And IMHO it is the better approach, because the served pages will be more likely to pleasure google and Searchers as they only contain the information and not the rest of the baggage. (smaller file size etc.). Because the forums are closed you will also not have a problem with duplicate content. And those archives are also a useful feature for users.
Furthermore you'd have more control over which forums are archived, or maybe even at the topic level. Much like the "library" here at webmasterworld.

Anway those are my thoughts. I also don't believe you'll get any hits from google (and Googleguy) on this, as that would be a totally against how they have been communcating up to this point. But then again: I'd like to be proven wrong... :D

EliteWeb

10:26 pm on May 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I used the information on the phpBB site and google ate up all the pages, my hits doubled on some sites and trippled on others :D

ncsuk

10:29 pm on May 21, 2003 (gmt 0)

10+ Year Member



Where exactly is this info?

ScottM

10:35 pm on May 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[phpbb.com...]

is the thread to remove session id's.

I've tried it and it works very well.

Basically you cloak for Google...and although the very term 'cloak' causes some people to get a little worried, in this case, I believe Google has been pretty straightforward in saying they would not penalize this type of cloaking.

jimh009

11:02 pm on May 21, 2003 (gmt 0)

10+ Year Member



Don't be so sure about the hack that allows the spidering of phpBB2 forums. I used that hack on my site, and it worked for many months - until now. But during the last pass through, google really messed up something that affected my home page.

As such, I had to re-instate the SID's and banned all spiders from the forum, just for good measure.

Might want to read the message I put in another forum today about it.

[webmasterworld.com...]

Jim

eaden

10:51 am on May 22, 2003 (gmt 0)

10+ Year Member



a few things that phpBB can do :

- Easiest and biggest impact : use <h2> etc tags, instead of <span class="heading"> etc. Google doesn't know <span class="foo"> from <span class="bar"> but it does know what a <h2> is.

- include that google no append_sid() cloak

- remove "view forum" and "view topic" from the title i.e. instead of "MyForum.com :: View topic - Test Post" it should be "MyForum.com - Test Post"

- make index.php renameable so that it's possable to install phpbb in the root of your server, e.g. I edited the php files, and replaces all "index.$phpEx" with "forum.php". So now my forum is at mysite.com/forum.php and mysite.com/viewtopic.php, but index.php has nothing to do with the forum. It was quite a bit of work to rename them all. This gives the forum better page rank.

- maybe remove some of the redundant links, e.g. only have one link to the current forum on a view topic page.

- all I can think of for now.

ncsuk

10:59 am on May 22, 2003 (gmt 0)

10+ Year Member



It would be great if I knew how to modify php ;(

tombot

11:13 am on May 22, 2003 (gmt 0)

10+ Year Member



Interesting. I am glad to see that you are concerned about making your software Google friendly. Just a few days ago I was looking at forum software that is exactly that and what I came up with was yours, phpBB. I didn't realize that the successful sites were using hacks, but it's good to know that those are available.

I will be using phpBB when I start my forum.

trillianjedi

11:26 am on May 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I posted a while back the problems with SEO on phpBB and PostNuke/phpNuke type sites:-

phpBB:-

1. Page titles are very poor (same title for every page - no relation to content).

2. Dynamic URL's

1 is tricky but very important for good SEO. Take a look at the forum on this very site. See the page titles? They reflect the name of the thread - that's good spider food and great for users/searchers. phpBB doesn not do that "out of the box".

2 is quite easy to fix with mod-re_write (or custom version of the same).

In my view, if you're building a new forum around phpBB - get those things fixed before you launch. If you can't fix it or can't pay someone who can, use something else.

PostNuke/phpNuke:-

Same three as above plus:-

3. CSS code makes <h1> style tagging impossible

4. No "Description" tag in pages

5. Same meta tag data in every page (you can set it to "dynamic meta data" but this just fills the meta tags with the entire contents of the page! i.e. Spam).

Again, if you have the opportunity to fix those before you launch, do it. It's easier in the beginning - if you break it, it doesn't matter - you lose nothing.

Without those 5 things fixed, your website's chance of success is severely limited.

TJ

trillianjedi

11:35 am on May 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, my post is misleading regarding page titles.

phpBB on it's own will do good page titles, but running as a module within PostNuke/phpNuke, it won't.

Those are faults with PostNuke/phpNuke and not phpBB.

TJ

Swamper

4:47 pm on May 22, 2003 (gmt 0)

10+ Year Member



My problem was not trying to get google to index the forums, it was trying to get google to stop indexing the forums part of my phpnuke site with the phpbb mod.
Google was getting stuck on session id's ONLY in the forums. It racked up over 120,000 hits to my site in less then 3 days.

Looking for the correct solution I found that I could disable session id's to google only.

[phpbb.com...]

The only change I had to make to my phpbbtonuke mod from [toms-home.com...] was to leave in the line

$url = "../../../modules.php?name=Forums&file=$url";

This seems to have pleased googlebot as it has slowed down indexing with session id's. I imagine it will come back with those session id's once more to verify.

I also contacted Google via email and the reply I got was as follows.

Shutting off session IDs for Googlebot will allow for a more efficient
crawl of your site by our robots. In addition, shutting off session IDs
should have no effect on the ranking or the inclusion of your site in the
Google index.

In addition, the Google index does include pages that have question marks
in their URLs, including dynamically generated pages. However, these
pages comprise a very small portion of our index. Pages that contain
question marks in their URLs can cause problems for our crawler and may be
ignored. If you suspect that URLs containing question marks are causing
your pages not to be included, you may want to consider creating copies of
those pages without question marks in the URL for our crawler. If you do
this, please be sure to include robots.txt on the pages with the question
marks in the URL that block our crawler to ensure that these pages are not
seen as having duplicate content.

The site in question is [newbands.ca...] .
Originally the server load and mysql was quite high the first two days of google trying to do it's best.

As I said earlier, these seems to have pleased the google gods and my site is still in tact. My rating hasn't changed as of yet.

This is slightly unrelated yet it still seems to be applicable to the original question and should help PHPbb enabled sites to get listed better.