homepage Welcome to WebmasterWorld Guest from 54.145.183.190
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
Forum Library, Charter, Moderators: mademetop

General Search Engine Marketing Issues Forum

    
Query String URL's getting indexed as duplicate pages
Meta tags not an option, robots.txt doesn't seem to work. What's next?
MatthewHSE

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4748 posted 8:34 pm on Jun 10, 2004 (gmt 0)

My site uses a lot of .cgi scripts that use a lot of query strings. I don't want these URL's indexed as individual pages because, in most cases, they are in fact the same page as another. Here's what I mean:

A visitor comes to mysite.com/forums. He reads a couple of posts and finds one he'd like to highlight for future reference. So, he clicks this link:

http://mysite.com/forums.cgi?highlightpost&postid=9983745

Clicking the link highlights the post and returns the visitor to the same page he was viewing.

The problem is, Google sees that link as a separate URL. It "clicks" the link, and essentially just reloads the page it was just on. The result is that I have dozens or even hundreds of "duplicate" pages in the Google index. Obviously, these pages have real content that should be indexed. Therefore, noindex meta tags aren't an option since both URL's are in fact the same page. One is just the page, and the other has a query string, but the content is identical.

I tried the following in my robots.txt file:

Disallow: /forums.cgi?highlightpost

The hope was to use the wildcard nature of robots.txt syntax to disallow any URL containing "?highlightpost."

So far, Google has kept all those query-stringed URL's in their index, even though they hit our site at least once per day and seem to always look at robots.txt when they come by. I would have thought they would have seen that they were not supposed to hang on to those pages and dropped them, but so far that hasn't been the case.

Actually, I'm not even sure just what I'm asking here. But, I'm sure you have a good understanding of the situation. I want to keep those duplicate query-string URL's from being indexed, and meta tags aren't an option. Robots.txt doesn't seem to have done the job. Might this be a job for .htaccess, and if so, how would I go about doing it?

I want to remove all these pages from the Google index, plus prevent them from being indexed in the future.

Thanks,

Matthew

 

nakulgoyal

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4748 posted 12:32 am on Jun 11, 2004 (gmt 0)

Try the google remove feature.

MatthewHSE

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4748 posted 12:47 am on Jun 11, 2004 (gmt 0)

I looked at that, but honestly there are a LOT of these pages already indexed, and it will be a continuous problem and ongoing job as my site develops. Is there any way to automatically prevent those pages from being indexed? Perhaps use .htaccess to keep bots off pages with certain words in the URL's?

corz

10+ Year Member



 
Msg#: 4748 posted 4:45 am on Jun 11, 2004 (gmt 0)

that "wildcard nature" you hoped for is activated something like this
(from Google's own webmaster pages) ..

User-agent: Googlebot
Disallow: /*.doc$

similar to .htaccess syntax.
I happened to spot this earlier on, looking for something else.

;o)
(or

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved