Forum Moderators: open
I'm one of the senior developers of [url=www.phpbb.com]phpBB[/url] one of the Open Source forum systems. IMO we, as the development team, could help Google quite a bit in making the spider/index process more efficient. phpBB creates a lot of 'useless' links for the spider. An example; every post in a topic has a link that points to itself. So if you have a page like viewtopic.php?t=1234 then you'll have 25 links like: viewtopic.php?p=654321, viewtopic.php?p=654334 etc. Because that Google indexes a LOT of redundant pages. Furthermore it's fairly easy to serve a different (very minimal) template set for google to reduce parsing time and bandwith consumption, etc, etc.. But so far we haven't had any response from the Google team. IMO it would be nice if they could give us some hints, tips and pointers.
This is a timely topic, as I'm currently comparing phpBB and a couple of others for a forum upgrade, and all seem to have some search engine clinkers.
I think you have come to the right place for advice on how to work with Google. Don't hold your breath waiting for Google itself to give you optimization advice, although one never knows...
My quick take on getting content indexed properly without duplication but with maximum indexing:
1) Best - rewrite the URLs to look static, like the pages on this forum. Browse around here to see how multiple page threads are handled. I realize this might create more installation difficulties, but is the cleanest solution for more than just Google.
2) Keep dynamic URLs short and not redundant. Google is getting better at indexing dynamic content, but long query strings and different strings that lead to the same content are bad.
3) Don't use session variables. These are another big problem for search engines. There have been extensive posts at phpbb.org about hacks to eliminate session variables for spiders. Eliminating them for everyone would be even better.
I'm sure others will have input. Good luck, and thanks for dropping by.
PS - note that "serving different content" is also known as cloaking. One person's helpful disntinction between visitors is another person's punishable offense. I'd get some additional input on this issue before you roll anything out.
Thought you might want to have a look at my latest problem. Maybe you can also help me out.
The problem I have/had is googlebot choking on session id's.
Follow my whole thread here.
[webmasterworld.com...]
It is actully a mod of phpbb for phpnuke that I use. I have implemented a hack to session.php to check user agent and disable sessions for googlebot, but it does not seem to be putting a stop to googlebot still trying to use the session id's on the url. Unless they are trying to reindex with the 80,000 plus hits to my site that they attempted this weekend.
If you can help please do!
Swamper
1) Best - rewrite the URLs to look static, like the pages on this forum. Browse around here to see how multiple page threads are handled. I realize this might create more installation difficulties, but is the cleanest solution for more than just Google.
If anyone finds out the best and easiest way to do this please post it here or sticky me as well. I have just created my first phpBB, but all the configuration was done by my host, so I am a bit in the dark.
I would really like to make the pages appear static as mentioned, for best "indexibility".
If anyone finds out the best and easiest way to do this please post it here or sticky me as well. I have just created my first phpBB, but all the configuration was done by my host, so I am a bit in the dark.
some people have allready found the solution ,
i just did a search and i found these threads
[phpbb.com...]
[phpbb.com...]
[phpbb.com...]
[phpbb.com...]
hope this helps
It seems like software designers operate in a search engine vacuum - I just had a shopping cart author release an "upgrade" that hoses the URLs. When I try to offer suggestions that would benefit the whole user community, I come up against the author's dated and inaccurate knowledge of how SEs work. Argghhh... (Not talking about Bart, of course! He's here at WebmasterWorld! ;))
PS - I looked at one solution that claimed to convert the dynamic URLs to static ones. Well, it works - except that a gigantic session ID is in every "static" URL! Somehow, I think the indexing of that site won't benefit from that particular hack... ;)
I think what Bart is getting at is that the forum is an "application" rather than a purely static informational site. And as such it has many link, data, buttons etc. which only serve the user in using the application, but those things are "worthless baggage" for a searchengine. Hence the question on how to minimize the useless overhead that is created by making the regular site available as a whole.
I found that robots.txt alone is not a good way. Reason: Google will still serve these pages in the SERPS solely based on the anchor-text/links pointing to them, even if the pages are themselves not crawled.
Viable Solutions (again IMHO):
- only dropping the session-ID for "valuable links" that is links that vlearly identify a viewforum?f=xxx or a viewtopic?t=xxx page. Problem with viewtopic?p=xxx is that it will lead to massive duplicate content (25 posts all on the same page, make 25 different?p=xxx links that lead to the same page).
- Closing the forum-"application" to bots (as it already is through sids, but also with a nofollow, noarchive meta tag in the HTML-header) and making an archive available. This is what is/has being/been done for vB. And IMHO it is the better approach, because the served pages will be more likely to pleasure google and Searchers as they only contain the information and not the rest of the baggage. (smaller file size etc.). Because the forums are closed you will also not have a problem with duplicate content. And those archives are also a useful feature for users.
Furthermore you'd have more control over which forums are archived, or maybe even at the topic level. Much like the "library" here at webmasterworld.
Anway those are my thoughts. I also don't believe you'll get any hits from google (and Googleguy) on this, as that would be a totally against how they have been communcating up to this point. But then again: I'd like to be proven wrong... :D
is the thread to remove session id's.
I've tried it and it works very well.
Basically you cloak for Google...and although the very term 'cloak' causes some people to get a little worried, in this case, I believe Google has been pretty straightforward in saying they would not penalize this type of cloaking.
As such, I had to re-instate the SID's and banned all spiders from the forum, just for good measure.
Might want to read the message I put in another forum today about it.
[webmasterworld.com...]
Jim
- Easiest and biggest impact : use <h2> etc tags, instead of <span class="heading"> etc. Google doesn't know <span class="foo"> from <span class="bar"> but it does know what a <h2> is.
- include that google no append_sid() cloak
- remove "view forum" and "view topic" from the title i.e. instead of "MyForum.com :: View topic - Test Post" it should be "MyForum.com - Test Post"
- make index.php renameable so that it's possable to install phpbb in the root of your server, e.g. I edited the php files, and replaces all "index.$phpEx" with "forum.php". So now my forum is at mysite.com/forum.php and mysite.com/viewtopic.php, but index.php has nothing to do with the forum. It was quite a bit of work to rename them all. This gives the forum better page rank.
- maybe remove some of the redundant links, e.g. only have one link to the current forum on a view topic page.
- all I can think of for now.
I will be using phpBB when I start my forum.
phpBB:-
1. Page titles are very poor (same title for every page - no relation to content).
2. Dynamic URL's
1 is tricky but very important for good SEO. Take a look at the forum on this very site. See the page titles? They reflect the name of the thread - that's good spider food and great for users/searchers. phpBB doesn not do that "out of the box".
2 is quite easy to fix with mod-re_write (or custom version of the same).
In my view, if you're building a new forum around phpBB - get those things fixed before you launch. If you can't fix it or can't pay someone who can, use something else.
PostNuke/phpNuke:-
Same three as above plus:-
3. CSS code makes <h1> style tagging impossible
4. No "Description" tag in pages
5. Same meta tag data in every page (you can set it to "dynamic meta data" but this just fills the meta tags with the entire contents of the page! i.e. Spam).
Again, if you have the opportunity to fix those before you launch, do it. It's easier in the beginning - if you break it, it doesn't matter - you lose nothing.
Without those 5 things fixed, your website's chance of success is severely limited.
TJ
Looking for the correct solution I found that I could disable session id's to google only.
[phpbb.com...]
The only change I had to make to my phpbbtonuke mod from [toms-home.com...] was to leave in the line
$url = "../../../modules.php?name=Forums&file=$url";
This seems to have pleased googlebot as it has slowed down indexing with session id's. I imagine it will come back with those session id's once more to verify.
I also contacted Google via email and the reply I got was as follows.
Shutting off session IDs for Googlebot will allow for a more efficient
crawl of your site by our robots. In addition, shutting off session IDs
should have no effect on the ranking or the inclusion of your site in the
Google index.In addition, the Google index does include pages that have question marks
in their URLs, including dynamically generated pages. However, these
pages comprise a very small portion of our index. Pages that contain
question marks in their URLs can cause problems for our crawler and may be
ignored. If you suspect that URLs containing question marks are causing
your pages not to be included, you may want to consider creating copies of
those pages without question marks in the URL for our crawler. If you do
this, please be sure to include robots.txt on the pages with the question
marks in the URL that block our crawler to ensure that these pages are not
seen as having duplicate content.
The site in question is [newbands.ca...] .
Originally the server load and mysql was quite high the first two days of google trying to do it's best.
As I said earlier, these seems to have pleased the google gods and my site is still in tact. My rating hasn't changed as of yet.
This is slightly unrelated yet it still seems to be applicable to the original question and should help PHPbb enabled sites to get listed better.