Forum Moderators: phranque

Message Too Old, No Replies

htaccess to remove advertising query strings and prevent duplicates

how to remove advertising query strings from the root url

         

deadstar32

3:30 am on Mar 5, 2009 (gmt 0)

10+ Year Member



I'm new to this forum--I've been impressed with answers I've read here, so I figure there are experts around that can easily answer this question...(I hope!)
-----------------------------------
Htaccess
I'm completely stumped & need some expert advice on removing some advertising query strings from the base or root URL without touching any other query strings.

As anyone interested in SEO knows, duplicate content is a serious issue, no one will deny that it causes damage, especially on new sites.

I need to remove all query strings from the root folder "/", but allow any query string that feeds an existing page.

I'm running a phpbb forum, and ironically i installed a few seo mods, including a zero duplicate module, however, it does not address the simple issue of query strings present in my root url.
-----------------------------------
The Problem: Query Strings appending my URL / Domain

Yahoo is linking to my site & creating duplicate content:
example.com/?f=4
example.com/?f=5
example.com/?sid=5f987d8f7987

all of these links point to my home page, so it's creating endless duplicate entries in google & yahoo.
-----------------------------------
Two Exceptions:
There's only 2 cases where I want to allow a query string
there's pages with "SID" in the parameter.. these can't be touched

-----------------------------------
Simple Solutions Fail
Because these simple solutions allow no exceptions

Yes, there are simple solutions out there, but they damage the forum so it can no longer function!

I need to allow query strings anywhere there's a .php on the file, for example:

example.com/viewforum.php?f=5&p=5

URLs like this one here can't have the query string stripped, or the site no longer functions!

-----------------------------------
Working Towards a Solution

This didn't work for me:
RewriteCond %{THE_REQUEST} ^GET\ /.*\;.*\ HTTP/
RewriteCond %{QUERY_STRING} !^$
RewriteRule .* http://example.com%{REQUEST_URI}? [R=301,L]

Aside from the fact that it did nothing, it attempts to strip all query strings--we can't have that or my site won't function.

This also fails to work:
# Redirect anything with a query string, force www, use same path, and remove all the query string parts.
RewriteCond %{QUERY_STRING} .
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
-----------------------------------
What the Solution would do

I need a piece of htaccess that strips all queries, unless the page ends in .php

example.com/viewforum.php?f=5&p=8 would need to remain untouched, while

example.com/f=5&p=8 would need to be stripped down to example.com

there's no public directories on this site (except virtual ones). If possible, it would also be nice to strip query strings from urls like this:
example.com/forum5/?f=5&p=8

I think stripping queries from any page that doesn't end in .php would solve this--since there's no ".php" before this query.

deadstar32

3:34 am on Mar 5, 2009 (gmt 0)

10+ Year Member



I don't see an edit option, so forgive the double post:
There's TWO... maybe 3 cases where I can't have query strings stripped:
1-SID in the query (?SID=f98d7sf9878f if that's removed, the forum might not function in some cases)
2-the page ends in .php
3-most ideally, I should just detect the user agent and really strip things for bots....

I know how to allow good bots:
# REMEBER YOU ONLY NEED TO STARD MOD REWRITE ONCE
RewriteEngine On
# REWRITE BASE
RewriteBase /
# Allow nice bots
SetEnvIfNoCase User-Agent .*google.* search_robot
SetEnvIfNoCase User-Agent .*yahoo.* search_robot
SetEnvIfNoCase User-Agent .*bot.* search_robot
SetEnvIfNoCase User-Agent .*ask.* search_robot

Order Deny,Allow
Deny from All
Allow from env=search_robot

Anyway, I've been researching this for several hours, and I'm completely stumped...

deadstar32

4:59 am on Mar 5, 2009 (gmt 0)

10+ Year Member



Solution
This was posted by jdMorgan
(Moderator of this Forum )
Here:
[webmasterworld.com...]

Apparently this is my solution--I googled my brains out, and instead found this by browsing this forum! (go figure.... maybe google isn't all-that after all...)

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule !\.php$ http://www.example.com/%1? [R=301,L]

Thank you for providing the solution already. I guess I didn't search enough... At least you guys get some free content...

[edited by: jdMorgan at 5:31 am (utc) on Mar. 5, 2009]
[edit reason] Formatting fixed. [/edit]

deadstar32

5:03 am on Mar 5, 2009 (gmt 0)

10+ Year Member



Sigh.. I forgot... I need to also allow any queries with SID... or some users might have issues using the forum, logging in,etc...

I know the "!" means "not" (right?)...but how do I squeeze in another "not" statement so that I can allow queries with SID?

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule !\.php$ http://www.example.com/%1? [R=301,L]

jdMorgan

5:27 am on Mar 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]

Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

Comment-out the first RewriteCond line for initial testing with your browser. Then uncomment it and test with a user-agent spoofer after getting the other parts working first.

Remember to completely flush your browser cache before testing any changes to your server config code.

Jim

g1smd

7:49 pm on Mar 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Disallow: /?
in robots.txt would also help drop these URLs for sites that cannot apply the redirect. The redirect is the better method.

deadstar32

1:17 am on Mar 6, 2009 (gmt 0)

10+ Year Member



Disallow: /? is solid advice for someone with a fresh site with a 'clean slate'. The problem is, many bad URLs already exist and putting the /? could be a problem if yahoo, google, and MSN don't go back to those pages and receive my 302 instructions.

I'm actually using a slightly modified version of these instructions on multiple sites, because I have found problems with query strings getting indexed by search engines.

Thanks for this jdmorgan:
RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]

Will this still work If I just do it for all users, regardless? I'm not too worried about the advertising query string tracking until my URLs are clean.

This didn't seem to work:
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{REQUEST_URI} !\.php$
RewriteRule ^(.+)$ http://www.example.com/$1? [R=301,L]

I wish I understood apache directives better....

g1smd

11:02 am on Mar 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



*** receive my 302 instructions ***

302 is absolutely the wrong thing to be doing here. Do ensure it is a 301.

jdMorgan

2:26 pm on Mar 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> This didn't seem to work:

That tells us almost nothing. Didn't work in what way, specifically?

  • How did you test (e.g. what were your input URLs?)
  • What was the result? -
  • What page loaded?
  • What did you see in your browser address bar?
  • Any server errors?
  • Anything in the server error log?
  • How did these results differ from your expectations?
Jim

deadstar32

11:32 pm on Mar 6, 2009 (gmt 0)

10+ Year Member



jdMOrgan.. I was able to combine two of your examples into a series of rules that work perfectly for me:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteCond %{QUERY_STRING} !^SID=
RewriteCond %{QUERY_STRING} !^np=
RewriteRule !\.php$ http://example.com/%1? [R=301,L]

Anyway, let me state what the above does... unless a query string has ?SID= or ?np= it removes the query string and the question mark with a 301 redirect, for both bots & users (unless it's a .php page).

I noticed my portal mod needs to allow an ?np= query to allow people to scroll through recent post topics. I'm using the Board3.DE portal mod in conjuction with PHPBB SEO Simple MOD. The PHPBB SEO "no duplicate" mod supposedly already takes care of every possible duplicate--however, I'm just concerened because somehow yahoo has query strings on indexed urls (duplicate content issue for yahoo/google). I admit these pages were indexed before I installed the mod-so maybe the mod is 'taking care of it'. However, I'm too pro-active to not do something about it immediately (duplicate content is just such a huge problem).

Bots are already pushed off ?SID queries with a PHPBB SEO mod I installed--Otherwise, I would use that statment which includes a user agent condition you have in your first reply (jdMorgan).

RewriteCond %{HTTP_USER_AGENT} googlebot¦slurp¦teoma¦msnbot [NC]

Thanks for providing that example actually--it may come in handy in the future; those are usually the only specific bot user agents I'm usually concerned about.

On a final note, I may need to later insert a condition to allow queries on a specific .html page.. unanswered.html; however, this is not critical...

For now, I'm more than content--I'm thrilled I got the support I needed here. I knew this was a good forum. I've been finding solutions here for a long time through google searches for years probably. This is just the first time I decided to try & post here. THanks for the awesome free support!

[edited by: jdMorgan at 12:08 am (utc) on Mar. 7, 2009]
[edit reason] example.com [/edit]

g1smd

12:02 am on Mar 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's not free :) You now have to find a couple of questions somewhere in the forum that you can answer - that way the workload is spread over a large number of people.