Welcome to WebmasterWorld Guest from 18.104.22.168
I have a number of pages listed in google, which are the exact page. (My index page about 200 times).
They are all index.php(followed by some attribute).
I am worried about duplicate content, I have a few choices, and I would like your advice.
I can not add, "noindex" tag to these pages from the way the site is built.
1): I can not remove these "pages" to produce a 404, so I have set up a .htaccess to 301 redirect these url to my main domain.
2): I also added something to my robots.txt to stop google from getting on them.
Which would be best, stop google by the robots.txt, or allow google on and find the 301 redirects?
I could use the automatic removal from google, but I read here that that was just "hiding" them, and not solving the problem.
Any help would be much appreciated.
You CAN modify the PHP script. You add two lines of code right at the beginning of the page. In fact if you use Apache, you can add the code site-wide using the Auto-Prepend file. The code simply tests what the requested URL was and then for stuff that you do not want indexed it either does a 301 redirect to the canonical form (exactly the same way that an index.html to / redirect works, but for blah.php?whatever to blah.php) OR the script adds the <meta name="robots" content="noindex"> tag to the page it serves. Your choice which one to use.
In fact, the script could even be made to return a 404 status code for unwanted accesses, or this could be done through .htaccess to test the URL format and serve 404 to the unwanted requests.
All of the stuff that you say is not possible is actually possible using just a couple of lines of PHP scripting OR a couple of lines of code in the .htaccess file. Have you got a programmer telling you that it isn't possible? If so, what he means is "it's too much work for me", or "I don't actually know how to do it". That's a bit different to "it can't be done". :-)
Alternatively you can use robots.txt to disallow the unwanted URL formats. I did that a few months ago to get a 50 000 thread forum that was exposing 750 000 URLs to Google relisted as 50 000 thread URLs and a few thousand thread index pages. The other 680 000 URLs were delisted within a few months, except for about 20 000 that show as Supplemental but will drop out at the next Supplemental Index update. There are several previous threads here about that site.
For those who do not understand, index.php can be indexed as index.php?p=45, index.php?tag=tag1, index.php?etc.
Although these are different urls, they all point to the same duplicate page.
g1smd, thanks for your reply, although I did say "I" can't. Im sure many amazing things can be done, but not from my knowledge, using "is archive, is home" etc test tags, as the system thinks all these pages are the main page.... and of course I want the main page indexed.
I Have set up a 301 redirect va .htaccess to solve these problems, but no one has fully answered my question.
Should I leave this 301 in place and let google find them?, if should I block google accessing these pages via the robot.txt.
As for the other option of the meta tag in these urls, I do not know how to add it to just these urls.
The 301 redirect will get the duplicates dropped. So you can do it that way. It will take months for the Supplemental Results to disappear.
You can add the noindex tag. The script is modified so that the very first thing that happens is that the script asks "what is the full URL the request was for?". If that URL contains the parameters, then the script simply writes <meta name="robots" content="noindex"> into the stream of HTML code that is sent to the browser, and if it doesn't contain them, then it does not write the tag.
See also: [webmasterworld.com...]
The 301 redirect will herd a little bit more PageRank over to the "correct" URLs, compared to the other method, but there is very little in it.
Don't worry that the Supplemental Results for the redirected URLs stay around in the index for many months. They always do.
Your measure of success is in seeing that the URLs that you do want to be indexed, do get indexed, and that they do show up with a full title and snippet in the search results.
Once the redirect is in place you can safely ignore any Supplemental Results that show up in search results and which return a 301 or a 404 response. Google will eventually drop them from view. Sometimes it can take a whole year for that to happen. They will not be classed as duplicate content. They cannot harm anything.
In a few months time, when things have had time to settle down, you will need to carefully look at any pages where the URL returns "200 OK" but is still marked as Supplemental. Those might indicate that some problem still remains.
In particular, make sure that every page has a unique title tag and a unique meta description too. Make sure that it fits the content of the page that it is on.
You have solved and put a lot of my worries to rest which is great, cheers for that.
All of my content seems to get listed correctly right now with unique titles and snipits. I recently set up a script that takes the first 25 words of my article automatically, and/or gives me the choice to write something unique and specific for a meta description, Im hoping this will work well.
Thanks again g1smd.
I did a site: command search on my domain, and the pages that I wanted out of google have gone, everyone of them. These 100+ pages can are no longer returned what so ever.
I expected them to go supplemental, but now there is no sign of them at all which is great! (im now hoping they will no return)
I did not use the google removal tool either.
I'm not sure how or why they are completely gone. I first blocked google accessing these urls in my robots.txt file, (google sitemaps tool reported that it could not access them). Then later I set up a 301 redirect to my main domain.
I would like once again to thank everyone here for their help, especially g1smd. Cheers
It is likely that in one or two months time, that about a quarter to a third of the removed URLs will reappear as Supplemental Results and then hang around for many months.
Don't worry if they do; they are not classed as being Duplicate Content at that point. There is little you can do to control that action. It just seems to work that way now.
The site:www.domain.con -inurl:www search may prove useful. That shows certain types of Supplemental Results, even those with a www in the URL.
If you are finding duplicate pages in your wordpress installation, read the following checklist on eliminating these: