Quickest way to get pages (duplicate) out of Google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Quickest way to get pages (duplicate) out of Google

Ma2T

12:33 am on Sep 28, 2006 (gmt 0)

Hey there,

I have a number of pages listed in google, which are the exact page. (My index page about 200 times).

They are all index.php(followed by some attribute).

I am worried about duplicate content, I have a few choices, and I would like your advice.

I can not add, "noindex" tag to these pages from the way the site is built.

Options:
1): I can not remove these "pages" to produce a 404, so I have set up a .htaccess to 301 redirect these url to my main domain.

2): I also added something to my robots.txt to stop google from getting on them.

Question:
Which would be best, stop google by the robots.txt, or allow google on and find the 301 redirects?

I could use the automatic removal from google, but I read here that that was just "hiding" them, and not solving the problem.

Any help would be much appreciated.
Thanks guys.

shogun_ro

5:41 am on Sep 28, 2006 (gmt 0)

Made .htaccess 301 redirect and allow G to crawl those links.Don't block them in robots.txt.

hashcrunch

5:43 am on Sep 28, 2006 (gmt 0)

I didn't get it, how a single page is indexed 200 times and if its the case then how would you put 301 re-direct because it would be 200 pages re-directing to itself. Could you make it a bit clear?

SuddenlySara

5:57 am on Sep 28, 2006 (gmt 0)

I think the Google bots finding spam would be more than welcome here.

200 different index's?

SuddenlySara

6:09 am on Sep 28, 2006 (gmt 0)

could you make it a bit clear-er?

shogun_ro

7:53 am on Sep 28, 2006 (gmt 0)

"I didn't get it, how a single page is indexed 200 times and if its the case then how would you put 301 re-direct because it would be 200 pages re-directing to itself. Could you make it a bit clear? "

There will be 200 pages redirecting to main domain.
This is possible with .htaccess

g1smd

9:33 am on Sep 28, 2006 (gmt 0)

This question, or a variant of it, has come up about 20 times so far this month.

You CAN modify the PHP script. You add two lines of code right at the beginning of the page. In fact if you use Apache, you can add the code site-wide using the Auto-Prepend file. The code simply tests what the requested URL was and then for stuff that you do not want indexed it either does a 301 redirect to the canonical form (exactly the same way that an index.html to / redirect works, but for blah.php?whatever to blah.php) OR the script adds the <meta name="robots" content="noindex"> tag to the page it serves. Your choice which one to use.

In fact, the script could even be made to return a 404 status code for unwanted accesses, or this could be done through .htaccess to test the URL format and serve 404 to the unwanted requests.

All of the stuff that you say is not possible is actually possible using just a couple of lines of PHP scripting OR a couple of lines of code in the .htaccess file. Have you got a programmer telling you that it isn't possible? If so, what he means is "it's too much work for me", or "I don't actually know how to do it". That's a bit different to "it can't be done". :-)

Alternatively you can use robots.txt to disallow the unwanted URL formats. I did that a few months ago to get a 50 000 thread forum that was exposing 750 000 URLs to Google relisted as 50 000 thread URLs and a few thousand thread index pages. The other 680 000 URLs were delisted within a few months, except for about 20 000 that show as Supplemental but will drop out at the next Supplemental Index update. There are several previous threads here about that site.

Ma2T

6:01 pm on Sep 28, 2006 (gmt 0)

Thanks for the replies people.

For those who do not understand, index.php can be indexed as index.php?p=45, index.php?tag=tag1, index.php?etc.

Although these are different urls, they all point to the same duplicate page.

g1smd, thanks for your reply, although I did say "I" can't. Im sure many amazing things can be done, but not from my knowledge, using "is archive, is home" etc test tags, as the system thinks all these pages are the main page.... and of course I want the main page indexed.

I Have set up a 301 redirect va .htaccess to solve these problems, but no one has fully answered my question.

Should I leave this 301 in place and let google find them?, if should I block google accessing these pages via the robot.txt.

As for the other option of the meta tag in these urls, I do not know how to add it to just these urls.

g1smd

6:08 pm on Sep 28, 2006 (gmt 0)

I thought my post above answered everything.

The 301 redirect will get the duplicates dropped. So you can do it that way. It will take months for the Supplemental Results to disappear.

You can add the noindex tag. The script is modified so that the very first thing that happens is that the script asks "what is the full URL the request was for?". If that URL contains the parameters, then the script simply writes <meta name="robots" content="noindex"> into the stream of HTML code that is sent to the browser, and if it doesn't contain them, then it does not write the tag.

Ma2T

10:45 pm on Sep 28, 2006 (gmt 0)

Thanks for taking your time to re-explain things g1smd, very much appreciated.

The script editing sounds good, but I don't think I am up to that challenge.

As long as I hear you say the 301 is okay and will get the job done, im happy.

Cheers g1smd, much appreciated.

g1smd

10:53 pm on Sep 28, 2006 (gmt 0)

Yeah, both will get the job done.

The 301 redirect will herd a little bit more PageRank over to the "correct" URLs, compared to the other method, but there is very little in it.

Don't worry that the Supplemental Results for the redirected URLs stay around in the index for many months. They always do.

Your measure of success is in seeing that the URLs that you do want to be indexed, do get indexed, and that they do show up with a full title and snippet in the search results.

Once the redirect is in place you can safely ignore any Supplemental Results that show up in search results and which return a 301 or a 404 response. Google will eventually drop them from view. Sometimes it can take a whole year for that to happen. They will not be classed as duplicate content. They cannot harm anything.

In a few months time, when things have had time to settle down, you will need to carefully look at any pages where the URL returns "200 OK" but is still marked as Supplemental. Those might indicate that some problem still remains.

In particular, make sure that every page has a unique title tag and a unique meta description too. Make sure that it fits the content of the page that it is on.

Ma2T

11:03 pm on Sep 28, 2006 (gmt 0)

Very nice explanation g1smd, great in fact.

You have solved and put a lot of my worries to rest which is great, cheers for that.

All of my content seems to get listed correctly right now with unique titles and snipits. I recently set up a script that takes the first 25 words of my article automatically, and/or gives me the choice to write something unique and specific for a meta description, Im hoping this will work well.

Thanks again g1smd.

Ma2T

10:53 pm on Oct 3, 2006 (gmt 0)

I would just like to give you guys and update on my situation, as I find it quite interesting.

I did a site: command search on my domain, and the pages that I wanted out of google have gone, everyone of them. These 100+ pages can are no longer returned what so ever.

I expected them to go supplemental, but now there is no sign of them at all which is great! (im now hoping they will no return)

I did not use the google removal tool either.

I'm not sure how or why they are completely gone. I first blocked google accessing these urls in my robots.txt file, (google sitemaps tool reported that it could not access them). Then later I set up a 301 redirect to my main domain.

Interesting? maybe.

I would like once again to thank everyone here for their help, especially g1smd. Cheers

g1smd

11:14 pm on Oct 3, 2006 (gmt 0)

Glad that things are happening!.

It is likely that in one or two months time, that about a quarter to a third of the removed URLs will reappear as Supplemental Results and then hang around for many months.

Don't worry if they do; they are not classed as being Duplicate Content at that point. There is little you can do to control that action. It just seems to work that way now.

The site:www.domain.con -inurl:www search may prove useful. That shows certain types of Supplemental Results, even those with a www in the URL.

CainIV

4:53 am on Oct 4, 2006 (gmt 0)

Is_archive(), is_post() and is_date() are fucntions of wordpress.

If you are finding duplicate pages in your wordpress installation, read the following checklist on eliminating these:

[webmasterworld.com...]

Post 344.

g1smd

6:35 pm on Oct 4, 2006 (gmt 0)

Post 344 is that particular person's post count on that day.

It's the other 7-digit number that identifies the unique post.

CainIV

6:39 pm on Oct 4, 2006 (gmt 0)

DOH!

Ma2T

12:16 am on Oct 5, 2006 (gmt 0)

Many thanks g1smd, it's great to see things happening after much worry and work. I won't worry about the supplimentals either.

Thanks for the great info :)

CainIV, thanks for the info, it's a nice list, and I have done everyone of them over the last few weeks ;). Thanks for the link.