- agree with the OP but- google makes it quite clear that they dont index duplicate content.. - now usay does that = an entire site banning , well - if the whole site is exclusively duplicate content - yes........ i guess what i really want to know if the site offers some unique content in addition to this clone content and still gets banned?
Lets say you have a site www.abc123.com with a DMOZ listing. With that come all the clones.
Suddenly the clones are gone, or at least not counted by Google. Does abc123.com suffer for this?
IF Google has been ignoring or derating the clone links, then little or no problem.
IF Google HAS been counting all those clone listings, then lots of innocent sites
could be hurt, while the sites DMOZ did NOT like could benefit. -Larry
"Now they need to move on to Wikipedia and ban all the Wikipedia clones. I'm not talking about places that copy sections and then add to them or personalize them, just exact dupes. I also think they should ban sites that copy wikipedia exactly and then add a little extra scraped crap around it (answers.com for example)."
Twist, but that would go against the wikipedia's mission statement, which is to spread as much as possible around the web via mirror sites. I doubt that Google would do something which wikipedia is expressly opposed to.
Just do a search for some unique DMOZ text (such as the title and first few words of the description) and you'll see hundreds of results appear with the same linking structures... almost all supplemental.
The data seem to show Google randomly banning 37 percent of sites using any DMOZ data. Google does not appear to even be using the very obvious low-tech method of looking at the DMOZ list of sites using DMOZ data and banning all the ones that do not appear to have legal staffs or other ways to fight back.
jackson88: Actually, Google does index 63 percent of the DMOZ using sites, mostly 620,000 page clones. On one of the site: searches Google returned 330,000 pages! DMOZ encourages more webmasters to build DMOZ clones with 1.8 million backlinks each. Google bans them eventually. But are they catching up or getting further behind? If we check next year will it be 55 percent banned or 20 percent banned?
Google has a colossal problem with outright spammer scraper sites on throwaway domain names. ODP clones are not a priority. Trying to distinguish between "good" directory sites and "bad" directory sites is not even on their radar. Honest webmasters are the big losers in the spam wars as more and more clicks go to spammers and more honest sites get caught in the collateral damage.
jimbeetle: The study doesn't mention PR but a banned site would have a PR of zero on all its pages, wouldn't it?
Some of the sites seemed to have an ODP clone tacked on to a site about some subject. Others seemed to be stand alone clones of ODP. Actually you can get a pretty good feel by reading ODP's descriptions of the sites in:
[edited by: tedster at 5:05 am (utc) on April 6, 2006]
|Twist, but that would go against the wikipedia's mission statement, which is to spread as much as possible around the web via mirror sites. I doubt that Google would do something which wikipedia is expressly opposed to. |
What purpose does creating wikipedia clones all over the net serve. I see no difference between this and DMOZ. People copy the content exactly then completely cover the pages with as many ads as allowed. Who does this benefit, besides wikipedia. So wikipedia saves a few bucks in bandwidth cost, and surfers who didn't find what they wanted on wikipedia now have to click through 7-8 more wiki clones which is a complete waste of their time. Not to mention that all wiki clones are covered with annoying ads. If wiki needs money why don't they just cut the BS and put the ads on their site. At least the money would benefit wikipedia and not help crappy scraper sites like answers.com add one more useless results to the serps.
Besides, what do you think the DMOZ dupers next target will be? Dupers aren't exactly in the business of creating original worthwhile content. Just looking to spam serps with others work and their ads.
Twist: that's definately a vaild point, and I know a lot of people feel that way, but I have to say I disagree with the wikipedia analogy for a few of reasons:
Answers is a great example of company that has added value to the wikipedia model. They are not just a clone, they have over a hundred different databases of reference information. So you get everything they have for your search on one page. For that reason I prefer it to wikipedia. You just get more. How can you knock them for that?
Did they create the information? No, but they found a way to improve upon it, by having more resources side by side which the user can filter through. I think that is what wikipedia is all about; improving itself through free access. Just because those improvements weren't done on the wikipedia site makes no difference to them because they have no ads, and don't care at all about traffic.
|The study doesn't mention PR but a banned site would have a PR of zero on all its pages, wouldn't it? |
Well, that kind of goes to the heart of the questions that some of us have. Basically, is it a ban, or is it a dupe content penalty? Without some basics it's impossible to figure out anything. That's kind of why we've been asking questions and waiting for answers.
|What purpose does creating wikipedia clones all over the net serve. |
Like with free software people can make forks or do what answers.com does. This is the whole concept of wikipedia.
Wikipedia is essentially pointless without the free usage option, that is the base wikipedia was made for. The information itsself is/was somewhere on the web already and wp editors scraped all free content together.
No copying and improving on wikipedia content means essentially all the money donated to wikipedia could have went to some useful cancer charity.
Wikipedia is essentially a mega scraper site itsself.
Google = robot assisted scraping
Wikipedia = human assisted scraping
These scraping singularities have to be abolished if the web should become a place where everyone can do business and not just 10 megasites.
Commercially, luckily the spam has already reached such an extent that using google is pointless and you go to a site you trust. Hopefully the evergreen content section will move the same way. Then everyone can make a living and the days of the scraper sites are gone.
If I'm not mistaken, don't these topics appear here about every 12-18 months? My opinion? Google is just flushing it's index again. It's almost like there is this threshold for allowing the clones to propagate. Once that threshold is reached, Google purges them from the index. Are there an abundance of older clones that were not affected? Was it mostly newer clones that have gone missing?
What value does a dMoz clone serve anyway? Sorry dMoz, you guys/gals have whored your data for years now and many have abused it. Even though you have guidelines in place for such abuse it still happens and it is like poison. Once the abuse starts, it is much easier to take out entire networks rather than try to pick the rotten ones from the bunch.
If I were dMoz, that data would be licensed and paid for. No freebies. You'd eliminiate 95% of the garbage by moving that data into a paid model.
Back to the topic. It won't be long before many who are using regurgitated content find that it is pretty worthless in the overall scheme of things. Users don't want to perform searches and have to dig their way through directory listings and wiki content to find what they are looking for. Some of it is valid, most of it is junk.
|What about the future of directories in general, or non-dmoz clones? |
Here's the problem. dMoz was the poster child for directory structure. That was way back when. The model hasn't changed. Many have followed the dMoz model of categorization and/or taxonomy.
Think about this for a moment, if you were a Google, Yahoo!, MSN Search Quality Engineer and you wanted to present more relevant results to your visitors, what is one area you know for sure has been abused and mostly replicated over the years? Directories! They are a dime a dozen and not one has really changed the face of the industry. They all do things very similar.
There is only room on the Internet for a handful of global directories. Anything more than that is worthless. Now, there is plenty of room for niche specific directories. It'll be a little while before they start to propogate and start finding themselves without traffic. The established leaders in the niche will remain untouched as long as they've not stepped too far outside the "guidelines".
|Are users really excited about DMOZ data, even if repackaged? |
Not any more! The average experienced Internet user has been bombarded with SERPs that contain pages that all look the same (basically). After a while you become oblivious to them (they are like banner ads) and you click your back button as quick as you click on the SERP.
|I think all smaller directories are under attack. |
If their data is being regurgitated elsewhere, yes, that data may be coming under fire. But, is the authority, the originator of the content, coming under fire too?
|In other words, why are some apparently banned and not others; what might be the common factor, since DMOZ data alone doesn't do it. |
Oh, I'll bet if someone sampled 5,000 or 10,000 URIs that appear to be affected, there would be visible footprints all over the place.
Here's my vision of the Google Search Quality department...
They have this huge 10,000 square foot vault where there are 100" monitors lined up along the walls all synchronized to display this detailed map of the index. There are men and women stationed at their consoles wearing white lab coats with the Google logo emblazoned on them along with all their geeky accoutrements.
Each one of those men and women are responsible for certain sectors of the map. They have various routines they can run to see link networks, duplicate data, etc. They have F Keys assigned to nuke those networks when they reach a threshold.
- F1 - Purge 25% based on this criteria.
- F2 - Purge 50% based on this criteria.
- F3 - Purge 75% based on this criteria.
- F4 - Purge
- F5 - Reduce by PR1
- F6 - Reduce by PR2
- F7 - Reduce by PR3
- F8 - PR0
- F9 - Permanently Remove (Life Sentence, eligible for parole in 25 years.)
- F10 - Remove Forever (Life and a Day. No parole eligibility.)
- F11 - Shuffle Index (This is to make you go back and undo stuff on your site. You then present more flags to the GQE. ;) This is also industry specific.)
- F12 - Repeat
If I were a Google Quality Engineer, I'd purge all but the clones that were established and didn't have particular footprints. No, I wouldn't purge newer clones unless they fell within the criteria I'm using for the purge. That criteria could be quite extensive and is definitely based on years of research and the infinite data I have available to me as a GQE.
Hmm, but wikipedia or Dmoz content is as valid elsewhere as on the original site. [Not having a dmoz clone myself btw). At the time there were quadrilzillion of apache help files, I didn't much care where I read it.
Much worse is if you look for info on a video camera and all results are pretty much useless consisting out of empty pages linking somewhere else equally useless. In the end I went to amazon and read the reviews there.
This whole internet and everything is free mentality is the problem. If you have to pay, people might also have to pay you and while one would possibly have the same amount of money the economy itsself would get a boost, which wouldn't be a bad thing.
Well if they kill DMOZ clones and Wikiclones yet another Gnu project comes along and gets copied..
Either Google follows the Gnu [or sa whatever] principle that you can copy and make money, like they did themselves, or they should ban the whole GNU thing altogether.
A normal library index would state that "The wizard of Oz" is in library x y and z and it would still have the same content .., you can choose yourself where you want to read it, it's still the "Wizard of Oz"...
|Hmm, but wikipedia or Dmoz content is as valid elsewhere as on the original site. |
Not if it becomes used by thousands of other sites in pretty much the same format. There is no value there. I would hope that if I did a search for something that I'd see the authority sites listed first. I typically don't go past page one anyway for my personal searches because I power search. But the average user doesn't really understand the advanced search and they are left sifting through all the duplicate stuff. It's a mess.
Personally, the more duplicate data the search engines can purge the more refined their indexes become. The authority sites will be the ones to withstand the whims of the algos.
It's a first come first serve basis.
Keep in mind DMOZ has old origins and stuff like Google's PR was not and still today is not a concern while editing DMOZ.
I have a site that is listed in DMOZ and I happen to edit a cat in DMOZ. Why should I not use "my" cat also on my site's links page and be done with it? While I have the ability of more freedom from the guidelines (read rules, editors get slapped for not following them) and can add things to my site I cannot add to DMOZ.
My rationale for becoming a DMOZ editor was exactly that: I already kept a collection of links, why not edit theirs and use it on my site as well?
No, I don't have the entire directory on it, just the one relevant cat.
As for the backlinks asked for by DMOZ: yes they are mandatory to use any data coming from DMOZ, not giving that attibution will lead you to be blacklisted on DMOZ (No editor can add your site anymore).
And I see much less reciprocal link requests but instead see more submitals of their site to DMOZ ever since I have the attribution.
Moreover those backlinks don't generate all that much PR:
2 have the category in their URL so they are spread out over all the cats. And one points to the about page at DMOZ with "only" has PR 8.
I agree people just copying the entire database and slapping ads on them are not worthy of a google listing as they clutter the SERPS (and same goes for any other scraper/content replicator). But there are sensible uses as well, just don't copy the entire thing.
And I need no PR on my links page, that page is intended to help people find more information that I'm not giving them, or that I give them in a manner they don't like.
Hmm, with freely copyable content there is no authority site, but only an originator site. The fundamental error is to attach a value to a copy process. It's a copy ... The only valid option is to humanly rank the value of the copy process, the content itsself is a copy, hence it is meaningless ranking it, because it is a copy. It can be ranked cause it is so full of ads that it is a useless copy as a xerox with loads of dust on it is useless. That is a valid process. But since Google lives of ads it can't really rank them by ad content hence it pretends to use the double content "ranking".
Meaningful ranking of copies have to be external of the actual content. Things like webserver speed [aka wikipedia dying as usual a slow death in your browser], ad content, site context.
What is really much worse using the apache helpfile example, is when you would get a pretend copy and there is nothing on that page, like it is with most commercial context today.
This is great news if true. It is Google's perogative to include in their index whatever they deem useful to the user. DMOZ clones are simply clutter and it is unethical for DMOZ to piggyback off of Google's success...thereby creating clutter in Google's search results.
DMOZ should achieve its success on its own without backdoor entries into a SE index. Who needs a SE (or directory) within a SE?
> There is a new study that says Google
I saw the title to this thread and I thought, "who bumped a 4 year old thread!?" back to the top.
WT!? Google has bannned/pr0'd/booted any sites that use dmoz data for 3-4 years. If you have sites that were not pr0'd 4 years ago - well count your lucky stars - they finally caught up with you.
> banning entire sites
fruit from the poison tree. So many sites were using dmoz data as bot bait that they had to take action.
Until Google removes its own directory (a DMOZ clone) from its index then I'm not going to be happy about this.
Disclosure: I do have a DMOZ clone, but I make no money from it; I just had a cool directory name that was sitting around so I made a clone.
My experience says not true, or if it is, then it's done with some degree of sophistication. I have several sites based on DMOZ, some using all of DMOZ, some just a set of sub-categories and the PR has not been lost on any of them. However none of them are "DMOZ clones" that just do an out and out copy, they all have added value, and perhaps Google can recognize that.
One of them just jumped from PR 2 to PR 3.
|However none of them are "DMOZ clones" that just do an out and out copy, they all have added value, and perhaps Google can recognize that. |
Perhaps they can. If you were fortunate enough to have avoided getting purged, then you must be doing something right. Or, you just haven't crossed that threshold yet.
|One of them just jumped from PR 2 to PR 3. |
Hmmm, them's grounds for celebration!
PR2 to PR3 has no effect on the universe, none whatsoever. Especially viewing it from the public Toolbar.
I just noticed, in my niche, that two my competitors removed the Dmoz clone from their sites.....
they had not been penalized: maybe just prevention...
What if you simply disallow robots to access the DMOZ part of your site? Are you still in danger?
Probably not. Google aren't running a vendetta against DMOZ clones, they're just trying to protect the integrity of their index.
That means (1) filtering substantially duplicated pages from their index, and (2) identifying and discounting duplicated links, so that a single link from DMOZ doesn't become wildly overvalued.
Using DMOZ data should make those pages subject to duplicate content penalty and IMO rightly so unless you add your own value to that content other than slapping AdSense on the page.
It's unfortunate that dMoz has progressed to the point where it is today. They are considered the authority directory out there, or they used to be. All it would take is for one directory to show up with an equal to or greater than number of quality listings and dMoz would be history overnight.
But, I don't see that happening. dMoz has been around for a long time and has it's roots in the web. It's not going anywhere soon. Google will continue to purge the clones and maybe someday Yahoo! and MSN will catch on. Kudos to Google for doing their best in this area. They aren't perfect, but every now and then they make some good moves.
dMoz should have moved into a paid model years ago. They could easily have been the defacto directory on the Internet. Some cleanup behind the scenes amongst staff and off we go. But no, it continues to receive bad press from Webmasters all over the world and that is not good.
I look forward to the day that I don't see a post here from someone complaining about not being able to get a listing in dMoz. That would be so nice and welcomed. Too many people fret over a listing with dMoz when they should just submit (following the guidelines) and forget about it. dMoz doesn't hold the same value it once did. It's a link, that is all.
Just noticed the Gigablast [dir.gigablast.com] search engine seems to have been effected by this.
Other than the pagerank displayed in the Google copy of DMOZ what does Google provide to justify that it keeps its own copy of DMOZ in the index?
On the whole, I think this is a positive move, however the glaring hypocrisy really hammers home their "Do no evil" mantra.
|Other than the pagerank displayed in the Google copy of DMOZ what does Google provide to justify that it keeps its own copy of DMOZ in the index? |
Actually, they have two duplicate copies of DMOZ:
1. http://directory.google.com/ [directory.google.com]
2. http://www.google.com/dirhp?hl=en [google.com]
|what does Google provide to justify that it keeps its own copy of DMOZ in the index? ... the glaring hypocrisy really hammers home their "Do no evil" mantra. |
You call THAT "evil"? You've got to start reading the newspapers.
Have you geeks forgotten Google's reason for popularity?
Google is popular for having the most relevant pages turn up for your search. Sites regurgitating DMOZ data are not relevant search pages.
This went for 7 pages?
Just a note -- this thread is about Google's treatment of DMOZ clones. I just cleaned out a bunch of off-topic criticism and defense of DMOZ itself -- that conversation has been done to death in our Directories Forum.
We're here to talk about Google, please. No automated tape loops!
While reading this forum and when designing sites in my head I´ve often thought if I shouldn´t just pretend google didn´t exist. Too often I´m complicating matters in order to do what I think would please google.
| This 113 message thread spans 4 pages: < < 113 ( 1  3 4 ) > > |