Forum Moderators: phranque
As anyone interested in SEO knows, duplicate content is a serious issue, no one will deny that it causes damage, especially on new sites.
I need to remove all query strings from the root folder "/", but allow any query string that feeds an existing page.
I'm running a phpbb forum, and ironically i installed a few seo mods, including a zero duplicate module, however, it does not address the simple issue of query strings present in my root url.
-----------------------------------
The Problem: Query Strings appending my URL / Domain
Yahoo is linking to my site & creating duplicate content:
example.com/?f=4
example.com/?f=5
example.com/?sid=5f987d8f7987
all of these links point to my home page, so it's creating endless duplicate entries in google & yahoo.
-----------------------------------
Two Exceptions:
There's only 2 cases where I want to allow a query string
there's pages with "SID" in the parameter.. these can't be touched
-----------------------------------
Simple Solutions Fail
Because these simple solutions allow no exceptions
Yes, there are simple solutions out there, but they damage the forum so it can no longer function!
I need to allow query strings anywhere there's a .php on the file, for example:
example.com/viewforum.php?f=5&p=5
URLs like this one here can't have the query string stripped, or the site no longer functions!
-----------------------------------
Working Towards a Solution
This didn't work for me:
RewriteCond %{THE_REQUEST} ^GET\ /.*\;.*\ HTTP/
RewriteCond %{QUERY_STRING} !^$
RewriteRule .* http://example.com%{REQUEST_URI}? [R=301,L]
Aside from the fact that it did nothing, it attempts to strip all query strings--we can't have that or my site won't function.
This also fails to work:
# Redirect anything with a query string, force www, use same path, and remove all the query string parts.
RewriteCond %{QUERY_STRING} .
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
-----------------------------------
What the Solution would do
I need a piece of htaccess that strips all queries, unless the page ends in .php
example.com/viewforum.php?f=5&p=8 would need to remain untouched, while
example.com/f=5&p=8 would need to be stripped down to example.com
there's no public directories on this site (except virtual ones). If possible, it would also be nice to strip query strings from urls like this:
example.com/forum5/?f=5&p=8
I think stripping queries from any page that doesn't end in .php would solve this--since there's no ".php" before this query.
I know how to allow good bots:
# REMEBER YOU ONLY NEED TO STARD MOD REWRITE ONCE
RewriteEngine On
# REWRITE BASE
RewriteBase /
# Allow nice bots
SetEnvIfNoCase User-Agent .*google.* search_robot
SetEnvIfNoCase User-Agent .*yahoo.* search_robot
SetEnvIfNoCase User-Agent .*bot.* search_robot
SetEnvIfNoCase User-Agent .*ask.* search_robot
Order Deny,Allow
Deny from All
Allow from env=search_robot
Anyway, I've been researching this for several hours, and I'm completely stumped...
Apparently this is my solution--I googled my brains out, and instead found this by browsing this forum! (go figure.... maybe google isn't all-that after all...)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule !\.php$ http://www.example.com/%1? [R=301,L] Thank you for providing the solution already. I guess I didn't search enough... At least you guys get some free content...
[edited by: jdMorgan at 5:31 am (utc) on Mar. 5, 2009]
[edit reason] Formatting fixed. [/edit]
I know the "!" means "not" (right?)...but how do I squeeze in another "not" statement so that I can allow queries with SID?
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule !\.php$ http://www.example.com/%1? [R=301,L]
RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]
Comment-out the first RewriteCond line for initial testing with your browser. Then uncomment it and test with a user-agent spoofer after getting the other parts working first.
Remember to completely flush your browser cache before testing any changes to your server config code.
Jim
I'm actually using a slightly modified version of these instructions on multiple sites, because I have found problems with query strings getting indexed by search engines.
Thanks for this jdmorgan:
RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]
Will this still work If I just do it for all users, regardless? I'm not too worried about the advertising query string tracking until my URLs are clean.
This didn't seem to work:
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]
I wish I understood apache directives better....
That tells us almost nothing. Didn't work in what way, specifically?
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{QUERY_STRING} !^np=
RewriteRule !\.php$ http://example.com/%1? [R=301,L]
Anyway, let me state what the above does... unless a query string has ?SID= or ?np= it removes the query string and the question mark with a 301 redirect, for both bots & users (unless it's a .php page).
I noticed my portal mod needs to allow an ?np= query to allow people to scroll through recent post topics. I'm using the Board3.DE portal mod in conjuction with PHPBB SEO Simple MOD. The PHPBB SEO "no duplicate" mod supposedly already takes care of every possible duplicate--however, I'm just concerened because somehow yahoo has query strings on indexed urls (duplicate content issue for yahoo/google). I admit these pages were indexed before I installed the mod-so maybe the mod is 'taking care of it'. However, I'm too pro-active to not do something about it immediately (duplicate content is just such a huge problem).
Bots are already pushed off ?SID queries with a PHPBB SEO mod I installed--Otherwise, I would use that statment which includes a user agent condition you have in your first reply (jdMorgan).
RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]
Thanks for providing that example actually--it may come in handy in the future; those are usually the only specific bot user agents I'm usually concerned about.
On a final note, I may need to later insert a condition to allow queries on a specific .html page.. unanswered.html; however, this is not critical...
For now, I'm more than content--I'm thrilled I got the support I needed here. I knew this was a good forum. I've been finding solutions here for a long time through google searches for years probably. This is just the first time I decided to try & post here. THanks for the awesome free support!
[edited by: jdMorgan at 12:08 am (utc) on Mar. 7, 2009]
[edit reason] example.com [/edit]