Forum Moderators: Robert Charlton & goodroi
I guess the reason I ask is that there are many websites, competitors of mine, that are much worse in terms of duplicate/similar content - so how was I identified and not other larger websites?
-- there are many ways for multiple urls to give the same exact content, and none of them are a happy situation for a site. Plug those technical holes first, if you have them.
Some are associated with duplicate content as Tedster explains; and there are many types of duplicate content as you can see.
There are other types of Supplemental Result though. One is where a Supplemental Result represents a page that is now 404, or even where the whole domain has expired. Google shows the result long after the real content is no longer available.
Another type of Supplemental Result is where the page is simply the previous version of the page. The current version is shown as a normal result, but if you search for keywords that were on the page some 8 to 30 months ago (and which are no longer on the current version of the page) then you see the same page as a Supplemental Result. The snippet will usually also show that same old content, but the cache will always be the one from recent days or weeks (except for a brief time last week when the old cache would show against the old results in several datacentres).
There are other types, too, and I am still working on what triggers their appearence, and what factors are required for that data to remain indexed.
In the BigDaddy datacentres there are Supplemental Results with cache dates going back to 2004 January. In the "experimental" datacentre some type of Supplemental Result have greatly increased in number and the "exact match quoted search" no longer works. In the "cleanup" datacentre most Supplemental Results with cache dates prior to 2005 June have been chucked away (at long last), but a large number of newer (dated 2005 July to 2006 March) Supplemental Results (mostly for other pages) have now appeared instead.
If the supplemental is the "old version" of the page, then Google will hang on to it for years. Don't worry about those.
If it is as of a result of "duplicate" content (e.g. www vs. non-www) then get the redirects in place, and hope it all works out.
If the supplemental is for a page that is 404, then make sure the URL really does return a 404 status, and wait for Google to update their index.
site:domain.com
site:domain.com -inurl:www
site:www.domain.com
.
Make a search for some content on your page that exists right now, and then do it for some words that were on the old page but are NOT on the current version of the page. See if any of those searches show the selected page as a normal result or as a supplemental result. Look at the cache date in both cases too.
I was quite lucky to spot this, as my example was a telephone number that had changed. It was very easy to see where Google showed pages that supposedly had that old number on them (there it was in the snippet) but when you clicked on those results you got through to a new version of the page with the old number nowhere to be seen, or else you got a 404 error, or found that the whole domain had expired.
You might have to think a lot harder to find a useful search phrase that will apply to your site.
None of these apply to my site, yet on some datacenters I have a ton of supplemental pages. I have some supps that are just old pages from 2005 but others are current and have been classified as supps for some reason. The only thing I can think of is that they are just too similar and are being penalized as duplicate content. I have begun changing the pages and making each one a bit more unique...we'll see if it helps.
How do you plug https and http hole?
Google recommends separate robts.txt files. See [webmasterworld.com...] for more.
That thread in completely inconclusive by the way. Nobody lists an actual tested and workable solution to the problem of https and http. Which by the way is a NEW google problem to add to the growing list of other stuff google can't do.
I dealt with this in another thread at [webmasterworld.com...]
I set out the best course of action in the last post. It worked well for us and Google were actually very helpful.
All the best.
I mean really how many people have had to spend countless hours of their time trying to figure out why Google crushed them instead of improving their own sites. I calculate my time spent on correcting MSN mistakes at near zero.
Make sure that every page of the site has a unique title and meta description.
Make sure that every page of the site links back to "/" and to the main section indexes.
Make sure that all domain.com accesses are redirected to the same page in the www.domain.com version of the site.
If you have mutiple domains, then use the 301 redirect on those such that only one domain is indexed.
If you have pages that say to bots "Error. You Are Not Logged In", for example "newthread", "newreply", "editProfile" and "sendPM" links in a forum, then make sure the link has rel="nofollow" on it, and the target page has <meta name="robots" content="noindex"> on it too.
If you have a CMS, forum, or cart that has pages that could have multiple URLs, then get the script modified to put a <meta name="robots" content="noindex"> tag on all but one "version" of the page.
Use the site: search to see what you have indexed, and work to correct these issues. The presense of Supplemental Results, URL-only entries, or hitting the "repeat this search with omitted results included" message very quickly are all indications that you have stuff that needs fixing.
It is a sad fact that systems like vBulletin, PHPbb, osCommerce, and a whole range of popular scripted sites, have a large number of SEO-related design errors built in to them. The designers are clever programmers, but have no clue about SEO or how their site will interact with search engines; and the situation isn't getting any better.
Run Xenu LinkSleuth over your site, and run a few pages through [validator.w3.org ] too - just in case.
If you have done all of that, then you'll just have to wait for Google to fix whatever they have broken at their end.
a "custom error page" that actually returns a 200 header for the bad url
This is a big problem. I'm running into various site structures where content is short term (e.g. real estate listings, job postings, etc.). Once the listing is removed, the page is still there with all the template stuff and a message stating that the listing is no longer available. The server is returning a 200 response instead of a 404/410. I've come across sites with thousands upon thousands of these pages and it's a mess.
Are they MAD or something?
I've actually had discussions with a few who do this. They state that they are maintaining the PageRank with the template page and will populate it with other content when it becomes available. Yup, their MAD. Especially when there are thousands upon thousands of pages like that. ;)