homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Disallowing Any Link with a Question Mark in it?
What is the correct way to do this. Is this the best approach?

 6:09 am on Feb 9, 2009 (gmt 0)

I no longer use any CMS, forums etc on a website I manage. So, other than the inbuilt search engine, there is no longer any content on the site that has an address with a question mark in the URL. Any content that needed to be 301 redirected, has been done so.

However, I notice that there is the occasional link to non-existent pages, that shows up in the browser. (I think some of these urls were indexed or linked to as the result of the somewhat flakey CMS/forum software that we used to use.)

For example:

www.exacmple.com/directory/page.html is a legitimate url

However, when an inbound link points to www.exacmple.com/directory/page.html?anyoldgiberish the same page shows up.

I am concerned about duplicate URLs, amongst other things. So I thought that I should disallow any URL with "?" in the title.

I'm aware that wildcards aren't supported by all bots, but Google is the main issue here, so I was planning to use:

User-agent: *
Disallow: *?*

Is the that the best approach, or does anyone have any better options?



 1:11 pm on Feb 10, 2009 (gmt 0)

Question marks will never go away completely since other websites can link to you using urls with "?" in it. You could play around with htaccess or do as you are doing and add a robots.txt wildcard.

Most situations I find myself in, I just ignore this issue. I monitor my activity reports on a monthly basis to make sure this little issue does not become a big one. When I do need to take action I prefer the htaccess solution since it consolidates the link juice of the url variations into the one good url I want to keep.


 2:00 pm on Feb 10, 2009 (gmt 0)

If the URL-path is valid, why not 301 redirect to that valid path, but with the query string removed? This will clean up the search results as well as old user bookmarks, and preserve any PR conferred by the link.



 5:28 pm on Feb 10, 2009 (gmt 0)

If you want to use robots.txt for that "clean up", this should be the right way:

Disallow: /*?

Depending on your situation, there may be more value in Jim's response about 301.

In my example, I have URLs that come from paid search where I throw variables on so I know which campaign, ad group, keyword, etc. the traffic and sale has come from. Since I already have my original page, and the rest is all same page with variables attached (duplicate content), I believe that 301 solution would fit better, rather than blocking via robots.txt.

It would be great if someone could comment more on the difference.



 5:51 pm on Feb 10, 2009 (gmt 0)

Disallow: /*?

Be aware that this only works with the major search engines. It is an undocumented, semi-proprietary "extension" to the original Standard for Robot Exclusion, and you cannot count on it to work with any search engine unless they explicitly say they support it on their "Webmaster Help" page.

So if you use this, you must decide on a method --and implement that method-- to handle robots that do not support this Disallow syntax in robots.txt.

Another suggestion that we often see is to use the Google, Yahoo, and MSN "Webmaster Tools" to remove unwanted URL listings from the SERPs. Again the problem is, what about all the other search engines?

As above, I recommend a 301 redirect to the same URL-path, but with the query string removed, rather than using a non-universal Disallow syntax in robots.txt or any particular search engines' "Webmaster Tools."



 8:12 pm on Feb 10, 2009 (gmt 0)

a 301 redirect to the same URL-path, but with the query string removed

...but only in the case where you don't need the query string but just want to get rid of it - which should the case of the initial post.

In my case, I use PHP to write data from the query string into a cookie (from that point I don't need stuff after ? anymore). If I do 301 redirect, since .htaccess gets read first, and PHP kicks in after, I lose all of my information as 301 happens before PHP script (even with PHP script being initiated from within .htaccess by auto_prepend_file).


 10:35 pm on Feb 10, 2009 (gmt 0)

So you could use mod_rewrite to check for the cookie if you're on Apache 2.x or later, or check it with your script (on any version server) and let your script generate the 301 if the query string isn't needed.

If the cookie is there, it's safe to remove the query string, and if not, leave it alone. And of course, you can check the URL-path to see if query-removal should be applied as well.

But let's get Mr. Bunny's problem taken care of first...



 10:49 am on Feb 11, 2009 (gmt 0)

Thanks all.

And yes, the htaccess solution would be the best to simply remove the query string. But, as Jim knows, I am a complete tool when it comes to mod_rewrite. :(

I am also scared about it clashing with some old code that I used when the CMS and forum were running. It dealt with PHPSESSID issues in the old forum URLs. It still has some function due to some PHP includes in static pages (which added the login box for the forum), but I should really just go through the site and remove the includes in the affected pages.

RewriteCond %{QUERY_STRING} !action=.
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

Is this something that I could adapt? Or should I just start again?

Overall, I thought that the robots.txt approach would be the least 'scary' for me and less likely to cause damage.

So far it is only the occasional url that seems to be adding the query strings, although they seem to be popping up in webmaster tools with more frequency.


 9:26 am on Feb 13, 2009 (gmt 0)

Hmmmm. Maybe I should use a Canonical tag [webmasterworld.com]?


 2:27 pm on Feb 13, 2009 (gmt 0)

If you can clearly differentiate between the old PHPSESSID URLs and the URLs with gibberish query strings, then "rule clashing" should not be a problem. Everything in mod_rewrite depends on the URLs (and query strings) and other things that are present in the client's HTTP request and in the server context information. If you can define the problem in terms of those variables, then mod_rewrite can be used to fix the problem.

For example, if all of the "gibberish query" URLs end in ".html", and none of the PHPSESSID URLs do, then this fact can be used to easily keep the two rules from interfering with each other.

It doesn't matter whether a technique is "scary" or not. What matters is to use the right technique to fix the problem at it's root, so you don't have to deal with other problems sprouting up.



 9:28 am on Feb 16, 2009 (gmt 0)

Thanks, as ever, Jim.

Of course, you are right. Unfortunately, the 'right technique' for me has to be one that I know how to implement. I am pretty confident with robots.txt. I know nothing about mod_rewrite, other than filling on the gaps for bits of code that others helpfully provide me with. And even then, I worry. ;)

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved