Welcome to WebmasterWorld Guest from 54.145.44.134

Forum Moderators: mademetop

Message Too Old, No Replies

Query String URL's getting indexed as duplicate pages

Meta tags not an option, robots.txt doesn't seem to work. What's next?

     

MatthewHSE

8:34 pm on Jun 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My site uses a lot of .cgi scripts that use a lot of query strings. I don't want these URL's indexed as individual pages because, in most cases, they are in fact the same page as another. Here's what I mean:

A visitor comes to mysite.com/forums. He reads a couple of posts and finds one he'd like to highlight for future reference. So, he clicks this link:

http://mysite.com/forums.cgi?highlightpost&postid=9983745

Clicking the link highlights the post and returns the visitor to the same page he was viewing.

The problem is, Google sees that link as a separate URL. It "clicks" the link, and essentially just reloads the page it was just on. The result is that I have dozens or even hundreds of "duplicate" pages in the Google index. Obviously, these pages have real content that should be indexed. Therefore, noindex meta tags aren't an option since both URL's are in fact the same page. One is just the page, and the other has a query string, but the content is identical.

I tried the following in my robots.txt file:

Disallow: /forums.cgi?highlightpost

The hope was to use the wildcard nature of robots.txt syntax to disallow any URL containing "?highlightpost."

So far, Google has kept all those query-stringed URL's in their index, even though they hit our site at least once per day and seem to always look at robots.txt when they come by. I would have thought they would have seen that they were not supposed to hang on to those pages and dropped them, but so far that hasn't been the case.

Actually, I'm not even sure just what I'm asking here. But, I'm sure you have a good understanding of the situation. I want to keep those duplicate query-string URL's from being indexed, and meta tags aren't an option. Robots.txt doesn't seem to have done the job. Might this be a job for .htaccess, and if so, how would I go about doing it?

I want to remove all these pages from the Google index, plus prevent them from being indexed in the future.

Thanks,

Matthew

nakulgoyal

12:32 am on Jun 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try the google remove feature.

MatthewHSE

12:47 am on Jun 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I looked at that, but honestly there are a LOT of these pages already indexed, and it will be a continuous problem and ongoing job as my site develops. Is there any way to automatically prevent those pages from being indexed? Perhaps use .htaccess to keep bots off pages with certain words in the URL's?

corz

4:45 am on Jun 11, 2004 (gmt 0)

10+ Year Member



that "wildcard nature" you hoped for is activated something like this
(from Google's own webmaster pages) ..

User-agent: Googlebot
Disallow: /*.doc$

similar to .htaccess syntax.
I happened to spot this earlier on, looking for something else.

;o)
(or