Forum Moderators: Robert Charlton & goodroi
Recently, on one of Matt's video's he also commented that the matter was complex.
When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.
Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?
The DMCA procedure takes too long. Has anybody had any experience of reporting the site to the plagarists ISP?
Scraper sites do also come into the equation but only where they publish exact or almost exact copies. Normal publishing of the same artcles on multiple unrelated sites is not so much of a factor because the rest of the navigation and other content and code on those pages will be quite different.
[edited by: g1smd at 11:41 pm (utc) on Aug. 28, 2006]
Honestly, its not really worth it unless you and the person you are sueing has deep pockets. Additionally, the person you are sueing could live in another country and might not even be able to be found.
The DMCA procedure takes too long. Has anybody had any experience of reporting the site to the plagarists ISP?
Yes. I wrote a firmly worded notification to the site owner, cc'ing the Admin Contact, and the management of the site's ISP/hosting company of an intent to file a DMCA complaint.
Within 48 hours corrective action had taken place to the point where I felt no need to submit the DMCA complaint.
It is a good idea to document your taking exception to someone's infringing on your branded materials.
The stuff to be worried about, and fixing up, is where the same content appears at multiple URLs.
This has been hashed around, but what % of duplicate content that is posted on external url's affects this 'filter' in Google if you will.
I notice some high end article websites doing fine in Google, although they are publishing content that is duplicated on other websites and most of them have the same meta descrip, title and keywords.
Does this mean that these websites have been deemed to have the original content, while the rest are filtered for the same content? What justifies who gets to wear that hat, is it based on internal pagerank, trustrank, age?
Hard and fast answer - If I have content that was published on my website long ago that others have since published (long before this fiasco began) is it wise to delete, or better yet rewrite it?
Todd
Recently, on one of Matt's video's he also commented that the matter was complex.
When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.
Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?
Ah yes, it seems that I have comprehensively managed duplicate content on these forums!
This thread could be accessed using:www.webmasterworld.com/google/3060898.htm
www.webmasterworld.com/google/3060898-1-30.htm
www.webmasterworld.com/google/3060898.htm&printfriendly=1
www.webmasterworld.com/google/3060898-1-30.htm&printfriendly=1
we are having a problem on our site that may have been caused by this problem with different URLs to access basically the same content.
On our site, for a page about blue widgets there would be a lot of text and info about blue widgets, and then at the bottom a comments box. People could post their info on there. It was possible to navigate on the comments box back/fowards and also sort by different criteria, which directed to URls like blue_widgets.php?comments_page=2 or blue_widgets.php?comments_sort=date
All these pages would have had the same main content about blue widgets, just showing different comments. Many of these are listed in supplemental now.
We put 'noindex' tags on all the pages a couple of months ago apart from the main 'blue widget' page, is this enough even though these have already been listed in supplemental? I may be splitting hairs here, I only bring this up as you mention using noindex to stop these pages getting in the supps index but then mention using redirects after they are already in there.
The pages that are shown in supplemental for a site: search seem to be very volatilehowever - the cache dates on them actually are going backwards. A few months ago all pages were moved to supplemental, but were fairly recent versions of the page. A month or so ago, the supplemental was then pages around around early May. Now the supplemental results are only showing yet another block from early March.
Thanks for any advice.
Yes! And I thought that I had spelt it out quite clearly in several threads already. Which bits are you not getting? I'll try to find another way to explain them again.
.
>> We put 'noindex' tags on all the pages a couple of months ago apart from the main 'blue widget' page, is this enough even though these have already been listed in supplemental? <<
That is the correct approach. You have one canonical URL that can be indexed. Google will now drop the other URLs from their index after one year has passed.
.
>> but then mention using redirects after they are already in there <<
Check my example post again: [webmasterworld.com...]
Only the URLs with differing parameters need the meta robots noindex tag. These will all be on the same domain as the canonical URL.
The URLs at non-www, on other domains, etc, need the 301 redirect to the canonical form, so that all content appears at one domain in the search results.
.
The cache date shown for a supplemental result sometimes depends on the search query you used.
Google has multiple results for a single URL. The normal index holds the current result. The supplemental result may contain data about what was on the previous version of the page.
During a supplemental update, Google:
- clears away older supplemental results for URLs that no longer exist or have been redirected for a long time (like a year or more),
- creates newer supplemental results (refreshes them) for duplicate content that is still duplicate and still served with "200 OK",
- creates brand new supplemental results for URLs that have been edited or redirected very recently (and will hold on to those for a year or more).
Those updates occured in at least 2005 August, 2006 February? and 2006 August.
Anyhow, we use subdomains (which can be useful with multiple production servers that might access shared stuff, like images), so it's inconvenient to 301 everything besides www.example.com, although I guess I don't have much choice. I think it does make something of an argument for having absolute rather than relative links. in our case, we use relative links mostly, so one link to 2fwww.example.com can result in our entire site (or most of it) being indexed that way.
I would just like to add what happens if the site is using a CMS or something similar that does all sorts of redirects etc. as they always do to get to the destination page.
Most CMS systems I know (JSP,ASP etc) do this and redirect to a URL that even worse includes a session id!
What I am asking is can Google crawl this, can it find duplicate content (it will do) and why in this day and age is this not acceptable.
My answer is that -
a) this type of shop is not good for Google
b) but after years of this type of shop it should be
My question is, how is this solved - not for me and the people I know - but the innocent people that buy shop software to publish their online shop (that will never make it into Google).
I have just reviewed what I have written and I know the answer - us techies will sort it.
What happens to the rest of them.
Why is it fair that we can fiddle about technically and help them - what about the sites that don't have a clue?
The only guys that win at the moment know this. I know, I have deployed thousands of e-commerce projects that have needed to be re-written.
But aside from the fact I am a great guy - who re-writes the other guys stuff?
Aren't we going backwards if we now have to re-write everything for the new Google?
What happens if he bought a site that doesn't use all of the above?
There are studies that prove that the average joe on the street doesn't buy a "known" e-commerce platfrom that is nice and open source - because he is not a techie.
In fact in the UK it just doesn't happen.
So lobbying just doesn't cut it. In fact lobbying is bobbins quite frankly.
In fact unless you are from the US the word "lobby" is crap - "lobby" means nothing over here. We don't "lobby" - and the fact that US companies "lobby" US government pisses us off big time - that is where scandals originate from.
How will "lobbying" help the thousands of businesses that don't use your off-the-shelf software? The majority of e-commerce providers in the UK don't use off-the-shelf crap, so how do we lobby that?
When I managed physical world shops, I ended up needing eventually to be reponsible for the implications of electrical codes, plumbing codes, health regulations, carpentry, architectural choices, and even the quality of soil under the shop and the seasonal changes in the water table. None of it seemed directly related to selling our widgets, but it comes with the territory of doing business in a shop.
And so I think it goes on the web. IF you want people to locate you through Google (and you certainly can set up shop on the web without that approach) then you need to stay in touch with the evolution of their technology and how your chosen technologies may interact with theirs.
There are at least two distinct issues here --
1. content intentionally published at several different locations
2. content that was intentionally published only at one address, but that the domain's server technology unintentionally allows that content to be accessed through many different urls.
It's that second case that seems to blind-side people so very often. It's this second situation we need to look at most of all.
What happens in the best possible scenario is that Google chooses what seems to be the "best" address for that content and filters out the rest so that search results don't end up being a choice of 10 addresses for the same exact thing. But sometimes Google gets flooded with so many address that appear to be the same content that something goes "tilt" and big problems appear for the site.
Google is only working to give their end users a good search experience. If you have good content related to the topic searched on, then Google would like to show your page in the results too. But if you create a big "cloud" of alternate addresses, then you are making things very difficult for them. And truth be told, any one site probably needs Google more than Google needs any one particular site.
That's the reality of the situation, as I understand it.
I said earlier that i thought alignment between software's and the SERP's frankly sucks [ It's like pre windows days - way too technical and un cordinated - IMHO ] , which demonstrates, if I'm right, that apart from a global handful of webmasters, few have any clue about the implications of the way they do things with online CMS's.
Three developers that I know well still pump out SE incompatible CMS solutions for top branded sites and clients - I mean we need to get some alignment happening [ sorry - I'm getting excited :) ]
How about a big red sticker on the box :
GUARANTEED - will not cause you any duplicate content problems.
GOOGLE COMPATIBLE
Money back guarantee.
Once one does it, everybody else will do it or loose it. G could be instrumental in some QC drive.
And what would be great is if Google kept the manufacturers in the loop about designing their products properly. Just a bit of PR once in a while would help raise the awareness. Just my thought - I'll clock off now!
[edited by: Whitey at 6:20 am (utc) on Aug. 30, 2006]
my si indexed in google like this: www.mysite.com and www.mysite.com/index.php got probably some link pointing back to my home with .php .
Didn't think was a huge problem until I have discovered that they have a different date in the cache!
How can I fix this problem?
I had a major headache with this but a site of mine was on asp and shared hosting so I had no access to the IIS. In the end i did a 301 redirect from index.asp to the domain, and changed the default page which is now not linked to at all (as all links to the home page now point to the domain)
There is some longer code that will handle all index pages on the site even those in folders and subfolders. See the thread at [webmasterworld.com...] for some examples.
First, redirect all instances of *.domain.com/*/*/index.php to the www.domain.com/*/*/ version.
Next redirect all non-www URLs to the www version. This catches all the other URLs that need redirecting.
Do the index redirect first. Do NOT do the non-www redirect first.
You need to avoid redirecting domain.com/index.php over to www.domain.com/index.php and then on to www.domain.com/. Doing it the other way avoids this redirection chain.
You want this to happen:
- domain.com/index.php redirects directly to www.domain.com/
- www.domain.com/index.php redirects directly to www.domain.com/
- domain.com/ redirects directly to www.domain.com/
- domain.com/anything.else redirects directly to www.domain.com/anything.else
The redirected URL will continue to appear as a supplemental result for about a year after the redirection is implemented. The cache for that URL will be frozen. It will continue to show the content just as it was, a few weeks before the redirect was implemented. You cannot change that action. Ignore it.
redirect 301 /index.php /
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.mysite\.com
RewriteRule (.*) [mysite.com...] [R=301,L]
I also have multiple paths to the same categories (which leads to different urls for the same page) which is another issue... Oh, and g1smd-your pm box is full.
If I remove every article that is a duplicate article on another site from my site, and return a 404 in a header check for that url, would you think that the same effect on ranking would be seen (assuming the change in late July was caused by duplicate content)?
Google will continue to show the redirected URLs as supplemental results for one year before dropping them. Don't worry about that, they are not harming things as long as the redirect is installed and working.
For example...
www.mysite.com/widget/
www.mysite.com/widget/default.asp
www.mysite.com/widget/default.asp?Partner=ABC
www.mysite.com/widget/default.asp?Partner=DEF
If all these URLs had identical page content and the partner parameter is only used for tracking referring sites, would there be a duplicate content penalty?
Would Google resolve these URLs?