Forum Moderators: Robert Charlton & goodroi
I have a forum that is well indexed by Google. But, as with most dynamic web tools, the same content can be accessible from different URLs. (I am using SMF, if anyone's interested).
I have used .htaccess mod_rewrite and robots.txt extensively to create SE friendly URLs and to exclude the indexing of duplicate urls for the same content. But, until recently, I had overlooked one area.
The URL for each thread is something like this;
http://www.example.com/forum/index.php/topic,1234.0.html
But, each posting within a thread, has a URL similar to this;
http://www.example.com/forum/index.php/topic,1234.msg321.html
Now, Google has been happily indexing both types of URL and, in search results, simply choosing one over the other to display. The other URL is consigned to the supplemental index. Which is kind of fine, I don't need both indexed.
Although the first 'root' url would obviously be the ideal one, if only because it is often linked to internally. But Google seems to like to link to the second type of URL, partially because that is the url used in the 'Latest Posts' feature on the front page (so is most accessible to Googlebot) and also because external web sites sometimes link to the post, rather than the 'root' of the URL.
My dilema is whether to exclude the second format of the URL via robots.txt, to try and get some kind of consistency.
Or, should I not fix what isn't broken? I have not suffered any kind of duplicate content 'penalty' so far (touch wood) and to block these urls now might exclude hundreds of pages.
Any thoughts?
[edited by: tedster at 7:07 pm (utc) on Nov. 1, 2006]
[edit reason] use example.com [/edit]
Now, Google has been happily indexing both types of URL and, in search results, simply choosing one over the other to display.
In my opinion, you shouldn't try to manipulate that situation. Good for you for recognizing that there is potential for chaos --if troubleever does start you'll know what to do. But it sounds like Google is handling the duplicate issue just fine in your case, and trying to "fix" something here might actually make trouble.
Indeed. I tried the robots rule for a week or two and it seemed to slow down the indexing process somewhat. But then that makes sense as the 'root' urls are buried one level deeper than the .msg urls.
I've gone back to the way it was and I'll see how it goes.
Any other opinions welcome.
It seems to me that, in essence, these differnt urls are not actually duplicate content in the true sense. And perhaps all this thinking about blocking duplicate urls is too much of a linear approach to a subject that... well, isn't all that linear. Especially when we are talking about content management systems and forums.
For example, when Google indexes one of the .msg URLs on my forum, it is essentially indexing a specific item of information.
And I've noticed, on ocassion, that when searching for the same thread, using slightly different keywords, that Google puts one version of the thread before the other in its results (i.e. it is not always the same url that is supplemental, or in the main index). So, rather than the different URLs being treated as 'duplicates', Google is quite cleverly acknowledging that the web isn't made up of lots of static documents with a beginning and an end. But is rather a cyclical depository of information that can be stored and accessed in numerous different ways. Especially in the case of forums where a thread can be extremely long and cover a range of subjects and viewpoints, it seems only logical that Google would index that thread at different points, depending on how its robots see the information being presented.
It seems to me that this is quite different to a web site having duplicate versions of its content scattered around different places.
Or am I missing the whole point here...?
Anyone else experiencing this? My eBay listings contain duplicate content on approximately 8,000 products. On relevant keyword searches, I find that my eBay listing URLs come up high in SERPs while my site URLs are buried in supplemental. Is this causing a dupe penalty?
Most of the pages went supplemental in early October, and I have waited for a rebound which has not happened. I also posted a reinclusion request. The site has a PR 5 on index and many URLs below with PR 4 and lower. The URLs with PR are indexed, but the thousands of category and product URLs are supplemental. I have compared the site against competitor's sites with similar products, PR, and backlinks and they are not having the supplemental issue.
I have in place 301 redirects for canonical issues, and also to change PHP queries to friendly HTML URLs.
Link to URL "B" on the page, and set up an internal rewrite from "B" to "A" as well as an external 301 redirect from "A" to "B".
If a user asks for "A" they are redirected to "B" with a 301. If they ask for "B" then they URL is silently and internally rewritten as "A" and the content served.
This does not cause a loop because one of them is an internal rewrite, and one is a physical redirect.