Forum Moderators: Robert Charlton & goodroi
Google not only runs their own copy of the entire Open Directory but they index their own copy in Google Search.
Neither Google nor DMOZ advise webmasters that running any DMOZ data on your site is very likely to get your entire site banned. DMOZ actively encourages sites to use DMOZ data. They even encourage webmasters to use free software for producing grossly duplicative and redundant “clones” of the entire 620,000 page Open Directory. I can understand how that could be intensely irritating to search engines.
The whole idea and promise of the “Open” Directory Project was that the data was to be freely available for use by any web site. This is effectively a fraud if only Google and their friends can use “Open” Directory data without risking being banned by Google.
The 71,816 DMOZ editors are also being victimized. They were told that they were contributing to an “Open” Directory, not acting as unpaid editors for a $100 billion dollar company.
I think Google and DMOZ both need to be considerably more “open” about this issue to develop standards for allowable use of DMOZ data, if any. If there is no acceptable use by folks that are not Google friends or partners, that should be made clear.
Suddenly the clones are gone, or at least not counted by Google. Does abc123.com suffer for this?
IF Google has been ignoring or derating the clone links, then little or no problem.
IF Google HAS been counting all those clone listings, then lots of innocent sites
could be hurt, while the sites DMOZ did NOT like could benefit. -Larry
Twist, but that would go against the wikipedia's mission statement, which is to spread as much as possible around the web via mirror sites. I doubt that Google would do something which wikipedia is expressly opposed to.
jackson88: Actually, Google does index 63 percent of the DMOZ using sites, mostly 620,000 page clones. On one of the site: searches Google returned 330,000 pages! DMOZ encourages more webmasters to build DMOZ clones with 1.8 million backlinks each. Google bans them eventually. But are they catching up or getting further behind? If we check next year will it be 55 percent banned or 20 percent banned?
Google has a colossal problem with outright spammer scraper sites on throwaway domain names. ODP clones are not a priority. Trying to distinguish between "good" directory sites and "bad" directory sites is not even on their radar. Honest webmasters are the big losers in the spam wars as more and more clicks go to spammers and more honest sites get caught in the collateral damage.
jimbeetle: The study doesn't mention PR but a banned site would have a PR of zero on all its pages, wouldn't it?
Some of the sites seemed to have an ODP clone tacked on to a site about some subject. Others seemed to be stand alone clones of ODP. Actually you can get a pretty good feel by reading ODP's descriptions of the sites in:
[dmoz.org ]
[edited by: tedster at 5:05 am (utc) on April 6, 2006]
Twist, but that would go against the wikipedia's mission statement, which is to spread as much as possible around the web via mirror sites. I doubt that Google would do something which wikipedia is expressly opposed to.
What purpose does creating wikipedia clones all over the net serve. I see no difference between this and DMOZ. People copy the content exactly then completely cover the pages with as many ads as allowed. Who does this benefit, besides wikipedia. So wikipedia saves a few bucks in bandwidth cost, and surfers who didn't find what they wanted on wikipedia now have to click through 7-8 more wiki clones which is a complete waste of their time. Not to mention that all wiki clones are covered with annoying ads. If wiki needs money why don't they just cut the BS and put the ads on their site. At least the money would benefit wikipedia and not help crappy scraper sites like answers.com add one more useless results to the serps.
Besides, what do you think the DMOZ dupers next target will be? Dupers aren't exactly in the business of creating original worthwhile content. Just looking to spam serps with others work and their ads.
Answers is a great example of company that has added value to the wikipedia model. They are not just a clone, they have over a hundred different databases of reference information. So you get everything they have for your search on one page. For that reason I prefer it to wikipedia. You just get more. How can you knock them for that?
Did they create the information? No, but they found a way to improve upon it, by having more resources side by side which the user can filter through. I think that is what wikipedia is all about; improving itself through free access. Just because those improvements weren't done on the wikipedia site makes no difference to them because they have no ads, and don't care at all about traffic.
The study doesn't mention PR but a banned site would have a PR of zero on all its pages, wouldn't it?
Well, that kind of goes to the heart of the questions that some of us have. Basically, is it a ban, or is it a dupe content penalty? Without some basics it's impossible to figure out anything. That's kind of why we've been asking questions and waiting for answers.
What purpose does creating wikipedia clones all over the net serve.
Like with free software people can make forks or do what answers.com does. This is the whole concept of wikipedia.
Wikipedia is essentially pointless without the free usage option, that is the base wikipedia was made for. The information itsself is/was somewhere on the web already and wp editors scraped all free content together.
No copying and improving on wikipedia content means essentially all the money donated to wikipedia could have went to some useful cancer charity.
Wikipedia is essentially a mega scraper site itsself.
Google = robot assisted scraping
Wikipedia = human assisted scraping
These scraping singularities have to be abolished if the web should become a place where everyone can do business and not just 10 megasites.
Commercially, luckily the spam has already reached such an extent that using google is pointless and you go to a site you trust. Hopefully the evergreen content section will move the same way. Then everyone can make a living and the days of the scraper sites are gone.
What value does a dMoz clone serve anyway? Sorry dMoz, you guys/gals have whored your data for years now and many have abused it. Even though you have guidelines in place for such abuse it still happens and it is like poison. Once the abuse starts, it is much easier to take out entire networks rather than try to pick the rotten ones from the bunch.
If I were dMoz, that data would be licensed and paid for. No freebies. You'd eliminiate 95% of the garbage by moving that data into a paid model.
Back to the topic. It won't be long before many who are using regurgitated content find that it is pretty worthless in the overall scheme of things. Users don't want to perform searches and have to dig their way through directory listings and wiki content to find what they are looking for. Some of it is valid, most of it is junk.
What about the future of directories in general, or non-dmoz clones?
Here's the problem. dMoz was the poster child for directory structure. That was way back when. The model hasn't changed. Many have followed the dMoz model of categorization and/or taxonomy.
Think about this for a moment, if you were a Google, Yahoo!, MSN Search Quality Engineer and you wanted to present more relevant results to your visitors, what is one area you know for sure has been abused and mostly replicated over the years? Directories! They are a dime a dozen and not one has really changed the face of the industry. They all do things very similar.
There is only room on the Internet for a handful of global directories. Anything more than that is worthless. Now, there is plenty of room for niche specific directories. It'll be a little while before they start to propogate and start finding themselves without traffic. The established leaders in the niche will remain untouched as long as they've not stepped too far outside the "guidelines".
Are users really excited about DMOZ data, even if repackaged?
Not any more! The average experienced Internet user has been bombarded with SERPs that contain pages that all look the same (basically). After a while you become oblivious to them (they are like banner ads) and you click your back button as quick as you click on the SERP.
I think all smaller directories are under attack.
If their data is being regurgitated elsewhere, yes, that data may be coming under fire. But, is the authority, the originator of the content, coming under fire too?
In other words, why are some apparently banned and not others; what might be the common factor, since DMOZ data alone doesn't do it.
Oh, I'll bet if someone sampled 5,000 or 10,000 URIs that appear to be affected, there would be visible footprints all over the place.
Here's my vision of the Google Search Quality department...
They have this huge 10,000 square foot vault where there are 100" monitors lined up along the walls all synchronized to display this detailed map of the index. There are men and women stationed at their consoles wearing white lab coats with the Google logo emblazoned on them along with all their geeky accoutrements.
Each one of those men and women are responsible for certain sectors of the map. They have various routines they can run to see link networks, duplicate data, etc. They have F Keys assigned to nuke those networks when they reach a threshold.
If I were a Google Quality Engineer, I'd purge all but the clones that were established and didn't have particular footprints. No, I wouldn't purge newer clones unless they fell within the criteria I'm using for the purge. That criteria could be quite extensive and is definitely based on years of research and the infinite data I have available to me as a GQE.
Much worse is if you look for info on a video camera and all results are pretty much useless consisting out of empty pages linking somewhere else equally useless. In the end I went to amazon and read the reviews there.
This whole internet and everything is free mentality is the problem. If you have to pay, people might also have to pay you and while one would possibly have the same amount of money the economy itsself would get a boost, which wouldn't be a bad thing.
Well if they kill DMOZ clones and Wikiclones yet another Gnu project comes along and gets copied..
Either Google follows the Gnu [or sa whatever] principle that you can copy and make money, like they did themselves, or they should ban the whole GNU thing altogether.
A normal library index would state that "The wizard of Oz" is in library x y and z and it would still have the same content .., you can choose yourself where you want to read it, it's still the "Wizard of Oz"...
Hmm, but wikipedia or Dmoz content is as valid elsewhere as on the original site.
Not if it becomes used by thousands of other sites in pretty much the same format. There is no value there. I would hope that if I did a search for something that I'd see the authority sites listed first. I typically don't go past page one anyway for my personal searches because I power search. But the average user doesn't really understand the advanced search and they are left sifting through all the duplicate stuff. It's a mess.
Personally, the more duplicate data the search engines can purge the more refined their indexes become. The authority sites will be the ones to withstand the whims of the algos.
It's a first come first serve basis.
I have a site that is listed in DMOZ and I happen to edit a cat in DMOZ. Why should I not use "my" cat also on my site's links page and be done with it? While I have the ability of more freedom from the guidelines (read rules, editors get slapped for not following them) and can add things to my site I cannot add to DMOZ.
My rationale for becoming a DMOZ editor was exactly that: I already kept a collection of links, why not edit theirs and use it on my site as well?
No, I don't have the entire directory on it, just the one relevant cat.
As for the backlinks asked for by DMOZ: yes they are mandatory to use any data coming from DMOZ, not giving that attibution will lead you to be blacklisted on DMOZ (No editor can add your site anymore).
And I see much less reciprocal link requests but instead see more submitals of their site to DMOZ ever since I have the attribution.
Moreover those backlinks don't generate all that much PR:
2 have the category in their URL so they are spread out over all the cats. And one points to the about page at DMOZ with "only" has PR 8.
I agree people just copying the entire database and slapping ads on them are not worthy of a google listing as they clutter the SERPS (and same goes for any other scraper/content replicator). But there are sensible uses as well, just don't copy the entire thing.
And I need no PR on my links page, that page is intended to help people find more information that I'm not giving them, or that I give them in a manner they don't like.
Meaningful ranking of copies have to be external of the actual content. Things like webserver speed [aka wikipedia dying as usual a slow death in your browser], ad content, site context.
What is really much worse using the apache helpfile example, is when you would get a pretend copy and there is nothing on that page, like it is with most commercial context today.
DMOZ should achieve its success on its own without backdoor entries into a SE index. Who needs a SE (or directory) within a SE?
I saw the title to this thread and I thought, "who bumped a 4 year old thread!?" back to the top.
WT!? Google has bannned/pr0'd/booted any sites that use dmoz data for 3-4 years. If you have sites that were not pr0'd 4 years ago - well count your lucky stars - they finally caught up with you.
> banning entire sites
fruit from the poison tree. So many sites were using dmoz data as bot bait that they had to take action.
One of them just jumped from PR 2 to PR 3.
However none of them are "DMOZ clones" that just do an out and out copy, they all have added value, and perhaps Google can recognize that.
Perhaps they can. If you were fortunate enough to have avoided getting purged, then you must be doing something right. Or, you just haven't crossed that threshold yet.
One of them just jumped from PR 2 to PR 3.
Hmmm, them's grounds for celebration!
PR2 to PR3 has no effect on the universe, none whatsoever. Especially viewing it from the public Toolbar.
That means (1) filtering substantially duplicated pages from their index, and (2) identifying and discounting duplicated links, so that a single link from DMOZ doesn't become wildly overvalued.
But, I don't see that happening. dMoz has been around for a long time and has it's roots in the web. It's not going anywhere soon. Google will continue to purge the clones and maybe someday Yahoo! and MSN will catch on. Kudos to Google for doing their best in this area. They aren't perfect, but every now and then they make some good moves.
dMoz should have moved into a paid model years ago. They could easily have been the defacto directory on the Internet. Some cleanup behind the scenes amongst staff and off we go. But no, it continues to receive bad press from Webmasters all over the world and that is not good.
I look forward to the day that I don't see a post here from someone complaining about not being able to get a listing in dMoz. That would be so nice and welcomed. Too many people fret over a listing with dMoz when they should just submit (following the guidelines) and forget about it. dMoz doesn't hold the same value it once did. It's a link, that is all.
Other than the pagerank displayed in the Google copy of DMOZ what does Google provide to justify that it keeps its own copy of DMOZ in the index?
On the whole, I think this is a positive move, however the glaring hypocrisy really hammers home their "Do no evil" mantra.
Other than the pagerank displayed in the Google copy of DMOZ what does Google provide to justify that it keeps its own copy of DMOZ in the index?
Actually, they have two duplicate copies of DMOZ:
1. http://directory.google.com/ [directory.google.com]
2. http://www.google.com/dirhp?hl=en [google.com]