Forum Moderators: Robert Charlton & goodroi
We need to keep this thread focused on the followings:
- Changes on your own site ranking on the serps (lost & gained positions or disappearance of the site).
- Changes you have noticed on the new serps (both google.com and your local google site) especially in regards to the nature of the top 10 or 20 ranking sites.
- Stability of the serps. I.e do you get the same serps when you run the same query within the same day or 2-3 successive days (both google.com and your local google site).
- Effective ethical measures to deal with the above mentioned changes.
Thanks.
There are a LOT more of the former. In fact I'm beginning to think there may be a staggering number, which helps explain why Google has trouble swatting them down.
Outland - TheBear's right about following the money - it's got to be VERY profitable to steal content if it gets indexed, especially above the originator. In one case a site took our very original state descriptions and now outranks us. Their biz up, ours down.
But part of the big problem is that NO content is 100% original. We use our own writers plus database plus public domain. Some would consider it fine quality and some might call it pulp non-fiction.
Well, the following might be 100% original:
3uldjfe.,jdjdi..sh92, &%#$@!, &%^$#!
I beg to differ with you on this point. I know I have content that is totally original in my case studies as well as writings on several topics - you couldn't find the subject words anywhere on the internet until they got swiped from my site.
As for what the benefit is for copying content - In my experience someone hires a person to setup a website and that person searches G on the topic so as to get text to put on the page - this has happened DOZENS of times to my site just in the last few months.
So not only are these folks copying my content but they are competitors whom are new to the field and now they range above me.
The scraper sites search G and copy the first 10 results. As I've mentioned before i had the pleasure of being in the top 10 for many thousands of search phrases so I'm in many thousands of directory scrapers.
"NO content is 100% original"
SailorJ - we probably agree on this, I was being philosophical and meant the 100% literally. In a James Joyce novel you'll find word combinations from other works, therefore it's not "100% original" even though he's be considered BY FAR one of the most original English speaking authors.
This is not a trivial issue because Google needs to make those determinations via the algo. MikeNo's question is very important to these "anti spam" updates. I've been assuming duplication is determined by a percentage of page content and I'd (wildly) guess they are ratcheting down the percentage, triggering more and more dupe content filters and penalties.
And another thing I'm now concerned about is that I re-wrote some paragraphs in order to escape the thieves. I re-worded sentences changed from 3rd person to 1st changed to alternate thesaurus entries for some words.
Does google only look at exact phrase matches or does it go beyond that and use LSI or someother method to look at dup content more broadly.
GG: can you give any input on this - like is it beyond exact phrase match?
Joe
So what do people say constitutes copying, sufficient for a dupe penalty?
Definitely when someone copies your whole page it's duplicate content. Also if you have an article, chart, graphic, anything like that duplicated by someone else without your permission it would be duplicate content . I see it kind of like copyright laws. It's OK to quote a snippet from a work if it is used as a review, critique, reference or such. So even when scraper sites copy a snippet they are within their rights as they are using it to tell people about our site whether we like it or not.
Now that's my definition and pretty close to copyright guidelines. But I don't know if that is what Google means. Are they looking more for duplicated content on the same site or network of sites? Or is it that along with my definition as well?
I diversify.. I build more than one site per niche.. tweak each one differently. If I get more than one in top 3 then that's gravy. But that is my advice to you all diversify. And stop acting like you are on google welfare. They don't owe you anything. They don't have any "OBLIGATIONS" to do anything. They are a public company that only obligation is to their share holders. Not to you guys. And this is coming from a die-hard google hater...
Someone please put this thread out of its misery.. I can see this thread going on and then three months later, someone posts, I'm noticing update "Charlie"....
I do it for promotion and for the public good, as do many others. Apparently, if an AP story is circulated to over 100 or more websites, which is often the case, these sites, typically newspaper type sites, are NOT penalized because they are paying for copyrighted material?
It would be nice to know these things up front. I've been doing this for many months on political topics and for years on sports, so NOW I get penalized?
Personally, I think I should be able to post MY articles wherever they benefit me and the most people the most. Google needs to find another way to eliminate spammers and scrapers, IMO.
Sailor seems to indicate he thinks that just the snippet from G is sufficient. If this were the case I think whole sections of our site which are weekly, unique, hand written columns, but which use a template containing the title and a motto and some writer bio info common to all of them would be enough to be considered "duplicates" of each other by this definition.
What about pages on different sites which all happen use a very common expression like "Today is the first day of the rest of your life". Are they copying or being copied?
More questions:
- Can a page be a duplicate of another page on the same site just by copying templates as in my example above? Without consistency pages look like cr@p and visitors get confused.
- Can G detect and differentiate if a single page is copying from MORE THAN ONE other site, like a scraper does?
- What does this mean for educational sites which include PhD thesis' with embedded permitted quotes from multiple famous works? Are they or the original penalized? Or newspaper stories which directly quote a current public figure in todays news?
You are right - the scrapers who copy a snippet are not the issue unless they 302 to you and thereby possibly causing a problem.
There are other scrapers who somehow go after the first paragraph on a page - don't know if they do it manually or automatically.. I have a lot of 'thin' pages in the programming examples area .. so I think I am at risk on those pages.
Don't panic!
If you can somehow get out word of your location, we will send the police to stop the person who is holding the gun to your head and forcing you to read this thread.
I think SteveB was talking about me having four different verisons of my home page: site.com. www.site.com, www.site.com/index.html, www.site.com/index.shtml. I did not create these. I write to one page, www.site.com/index.html. I link to www.site.com internally.
Steve said this is suicide, but I didn't know I was committing it, really. If I had known I would have done whatever I needed to do to keep the site straight.
Steve also said Google has created four pageranks and four caches for me. First, I never pay any attention to pagerank. It just never seemed to be that important. I always managed to get good rankings (I was #1 for a number of keyword and key phrases before bourbon) without it.
On the cache, what is a cache but a snapshot or copy of my page(s)? And who gave Google the right to make that copy. I didn't, and I own the content, not Google. So, essentially, they are taking my content and penalizing me for it. Nice!
Like I said, I'm not a techie, and I didn't know you needed a complete education in webmastering to compete in this mess. And once again, I rank fine in Y and MSN, so why does G have to be so difficult. I thought they were better? In my estimation, they are not better. They are a PITA.
I 2nd that emotion.
I don't buy the big deal of www/nonwww/index.htm/noindex.htm
I believe the most important thing is your internal linking has to be consistently going to the same type of reference.
I have about 35 index.htm's for subdirectories. all the non-www versions have no pr/cache/backlinks - only the www version. Maybe that's a problem, but it isn't a problem that would wipe out and entire site.
I presently don't have any control over it anyway.
I'm sure 95% of folks with websites don't have a clue what the heck you guys are talking about and a lot of them rank just fine.
I'll go back under the bridge now.
Someone please put this thread out of its misery..
Dude - arguably the most important post regarding Bourbon, by GoogleGuy, was *today*. You may also fail to realize that due to a New Orleans Voodoo Curse if this thread dies everybody who has posted in it will expire as well, so we are all in this together.
My original question posed to GG was:
>With this update I am finding more and more content thieves hijacking large chunks of my web pages for Adsense and other things. How much is this duplication effecting my rankings and others? To me it seems a large portion of the Google algo would be penalizing for duplication.<
My opinion is this problem is well out of control and GG knows it. And the money trails lead more and more to Adsense scrapers that engage in this. But It's doubtful I'm going to get an answer like yeah Adsense scrapers can wreck your rankings with duplicate content. I would bet no answer at all unless it was a definite no. If no, that would be like saying I can build 100 domains all with the same content. Doesn't stand to reason.
Look, the days are over when you can sloppily throw something on the Internet and magically have someone else sort it out for you. Now more than ever *optimization* is important, and that includes taking the time to both learn and do, including constructing your website(s) consistently and sensibly. Instead you can choose to put four near duplicate copies of pages on the Internet on the same domain, link to them all and confuse the hell out of an easily confused bot, and then rant about how "they" ruined your site or business. No, you "ruined" it by not giving it enough loving care and attention.
"I didn't know you needed a complete education in webmastering to compete in this mess."
You need one before you should be allowed to wildly blame other people for problems you created, when the basic solution has been posted many times, including by Google Guy (also several times). Stop complaining and get off your butt and solve the problems you created.
You should. Unless you fix it, you'll be in the thread for years
I don't care how many times you say it.. I can buy sucky positions in the serps but I can't buy a total wipeout overnight due to some of these issues (other than dup content straight out and to a lesser extent 302's). Also can't buy going url-only for 80% of pages in 10 days because of these issues (www/non-www).
In my case it is likely some other problem since I got rid of nonwww in Feb by fixing my internal links. And I don't think it has to do with a bunch of missing </p>'s, or the limited use of tables for formatting or the other 1000 things we-all are throwing out there.
When you don't rank for an exact unique page title amongst 150 results it is something beyond www/nonwww, we call it a penalty.
I'm sure 95% of folks with websites don't have a clue what the heck you guys are talking about and a lot of them rank just fine.
I'm sure a portion of that 95% suddenly lost a good much of their traffic and don't have a clue why. That is why it's so important that Google solve the problem in such a way that the average Joe or Jane doesn't have to be a technical expert in order to avoid penalties like this.
On another topic. Jay, sure some people have been venting their frustration but there has been a lot of interesting discussion as well. I've sure learned a lot about the real world of the Internet in these Bourbon threads.
Does google only look at exact phrase matches or does it go beyond that and use LSI or someother method to look at dup content more broadly.
Great question and I'm guessing they are experimenting with different approaches. I'd speculate wildly that they define "duplicate content" as pages that share some percentage of text info relating to the query and then rank those on the basis of PR. PR quirks sometimes cause legitimate pages, whose content has been duped, to fall in SERPs.