Query String URL's getting indexed as duplicate pages

My site uses a lot of .cgi scripts that use a lot of query strings. I don't want these URL's indexed as individual pages because, in most cases, they are in fact the same page as another. Here's what I mean:

A visitor comes to mysite.com/forums. He reads a couple of posts and finds one he'd like to highlight for future reference. So, he clicks this link:

http://mysite.com/forums.cgi?highlightpost&postid=9983745

Clicking the link highlights the post and returns the visitor to the same page he was viewing.

The problem is, Google sees that link as a separate URL. It "clicks" the link, and essentially just reloads the page it was just on. The result is that I have dozens or even hundreds of "duplicate" pages in the Google index. Obviously, these pages have real content that should be indexed. Therefore, noindex meta tags aren't an option since both URL's are in fact the same page. One is just the page, and the other has a query string, but the content is identical.

I tried the following in my robots.txt file:

Disallow: /forums.cgi?highlightpost

The hope was to use the wildcard nature of robots.txt syntax to disallow any URL containing "?highlightpost."

So far, Google has kept all those query-stringed URL's in their index, even though they hit our site at least once per day and seem to always look at robots.txt when they come by. I would have thought they would have seen that they were not supposed to hang on to those pages and dropped them, but so far that hasn't been the case.

Actually, I'm not even sure just what I'm asking here. But, I'm sure you have a good understanding of the situation. I want to keep those duplicate query-string URL's from being indexed, and meta tags aren't an option. Robots.txt doesn't seem to have done the job. Might this be a job for .htaccess, and if so, how would I go about doing it?

I want to remove all these pages from the Google index, plus prevent them from being indexed in the future.

Thanks,

Matthew

Query String URL's getting indexed as duplicate pages

Meta tags not an option, robots.txt doesn't seem to work. What's next?

MatthewHSE

nakulgoyal

MatthewHSE

corz

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week