| 2:04 pm on Apr 4, 2006 (gmt 0)|
Does your study show that these DMOZ clone pages are actually banned -- that is, there are no results in a site: query? Or does it show that cloned pages are simply not getting PR? What happens if a normal search is done on the domain name of the clone site?
As a user, I WANT Google to filter out exactly duplicate information from the search results -- maybe this is also part of the picture.
| 3:23 pm on Apr 4, 2006 (gmt 0)|
Where did you get this information from that google is banning these sites?
| 3:54 pm on Apr 4, 2006 (gmt 0)|
The study was done using the sites listed in “sites using ODP data” published in the Open Directory Project (ODP) site at:
DMOZ also publishes “tools” for using ODP data including lots of free scripts for making exact copy clones of ODP’s 620,000 page directory. Some of the tools are “live” real time “scraper” scripts that allow a site to “virtually” host 620,000 pages of ODP data without actually hosting any of them. ODP is an edited directory. Presumably they approved of the listed “tools”. They also apparently do not have any objection to the real time parasitic clones even though ODP is now so slow that the parasites frequently time out (20 seconds) waiting for ODP to send them a page so they can send it to their viewer.
The study checked sites in the ODP list using the site: search on Google, Yahoo Search, and MSN Search. About 50 percent of them were banned by at least one search engine. Google was much more likely to ban these sites. (Google 37%, Yahoo 11%, MSN 9%) It did not seem to matter if the site was an exact live clone or was using only a small amount of ODP data in a much more responsible way.
We have a site that we think uses ODP data very responsibly but was caught in the collateral damage resulting from the spam war between ODP and Google. Our site was banned after operating for more than five years.
| 4:30 pm on Apr 4, 2006 (gmt 0)|
| 5:16 pm on Apr 4, 2006 (gmt 0)|
I am sorry about your site, Altair
But it is good news. I do hope this will decrease the weight of dmoz links, and be the final nail in the coffin of directories as time-travel "so 2 years ago" lists of links.
| 5:25 pm on Apr 4, 2006 (gmt 0)|
I too have a DMOZ clone site that was banned by Google after a while. I'm disappointed but I can hardly blame Google for doing so. How can they effectively list hundreds and hundreds of sites that are substantially identical?
| 5:41 pm on Apr 4, 2006 (gmt 0)|
They are not only banning the directory portion of the sites using DMOZ/ODP data. They are banning the entire site for the use of that data.
| 5:41 pm on Apr 4, 2006 (gmt 0)|
what about the future of directories in general, or non-dmoz clones?
| 5:45 pm on Apr 4, 2006 (gmt 0)|
Perhaps there's something involved in the ban in place of, or in addition to, use of dmoz data. If using dmoz data is the basis for a ban, it is a long way from fully implemented under one experiment conducted today for the grins and giggle of it.
I went and picked a small out of the way subsub...cat, albeit with a commercial tinge to it.
I lifted a phrase from the description of a site in said lowerlevelcat, plugged it into google as a quote, presto chango, thousands of results returned, with enough snippet to know it was returning clone site after clone site after clone site.
Tis unlikely anyone would ever search that specific phrase except as an experiment, but given the obvious duplication present, perhaps that's a good thang.
| 5:59 pm on Apr 4, 2006 (gmt 0)|
helleborine: Do you suppose DMOZ is doing what they are doing to get more PageRank for their sites? Maybe Google should ban DMOZ, the ultimate link farm! Its OK to hate directories if you love search engines but I am a little uncomfortable with the idea that Page and Brin get to determine everything about what I get to see on the Web. If there are no directories, search engines get total control over what we see.
jomaxx: I agree, all those identical clones create an impossible situation.
I assume you didn't put up your site with the idea you were going to get banned. Neither did we. Our site only uses a small amount of ODP data, The data is substantially reformatted, reordered, and repaged in a way that we think adds value, and we provide a lot of additional non-ODP data. Our users like it. But apparently it is not good enough to be in Google's index of 8,000,000,000 web pages! You can't please everybody, but one person at Google has more weight than thousands or even millions of other people.
| 6:19 pm on Apr 4, 2006 (gmt 0)|
You're using DMOZ's data for free, and at the same time accusing them of acting in bad faith and suggesting that THEY, by definition the authoritative site for this data, should be banned from Google?
That just sounds like sour grapes.
| 7:03 pm on Apr 4, 2006 (gmt 0)|
Are users really excited about DMOZ data, even if repackaged?
| 7:51 pm on Apr 4, 2006 (gmt 0)|
ownerrim: I think all smaller directories are under attack. Kinderstart, (the company suing Google), does not use ODP data in their directory.
jomaxx: Those 71,000 editors are doing a great job, but someone at DMOZ is doing a lot of things that certainly look like they are calculated to irritate the search engines. I distinguish between the editors and advertised purpose of ODP (good), and the sleazy practices and the fact that ODP isn't really open anymore (bad). Google would not dare ban DMOZ and keep using DMOZ data in their own clone. Even an 800 pound gorilla has limits.
| 8:05 pm on Apr 4, 2006 (gmt 0)|
I still want to know, what does "banned" mean in this case? Are we talking completely removed from the index? penalized? filtered? what results show in a site: query? are cloned pages are simply not getting PR? What happens if a normal search is done on the domain name of the clone site?
kevinpate's post above has already noticed that some sites with DMOZ data are showing up in searches. So just using cloned data does not automatically cause a site to be banned.
| 8:30 pm on Apr 4, 2006 (gmt 0)|
sooner or later google will ban dmoz also and may start paid inclusion like yahoo. reason being, DMOZ data is now not as fair as it use to be couple of years back. And people almost everyone in the web trade knows that dmoz is no longer neutral as use to be. honesty of its editors is now often being questioned.
so why would google want to get associated with such site whose credibity is coming under fire every now and then.
| 8:57 pm on Apr 4, 2006 (gmt 0)|
Whether the people at DMOZ suggested it or not, why would a webmaster want to throw up a dup copy of DMOZ. Sure Google did it, but they also added PR ratings to it. I understand that this isn't much of a change, but I actually like seeing how my PR rates against others in a category. I see no reason to ever see an exact copy of DMOZ, just show me DMOZ itself.
All I can say is, about time.
Now they need to move on to Wikipedia and ban all the Wikipedia clones. I'm not talking about places that copy sections and then add to them or personalize them, just exact dupes. I also think they should ban sites that copy wikipedia exactly and then add a little extra scraped crap around it (answers.com for example).
I hate seeing referels from answers.com that I know are coming from a link I added to Wikipedia. A section of Wikipedia that I contributed to by writing an article and adding a link to some photos. Knowing that they have taken my work, done nothing to improve it, slap ads all around it, and to top it off outrank wikipedia because they have saturated their page with a particular keyword by adding more scraped content. Sounds like a good enough reason for a permanent ban to me, or at the very least quit putting their scraped content above Wikipedia in the serps. After seeing them outrank wikipedia I no longer have any desire to ever again spend time contributing to wikipedia. Why should I help the ________ at answers.com make money from my work?
| 9:45 pm on Apr 4, 2006 (gmt 0)|
Did you have the banner attribution to dmoz?
If you reorganized, reclassified and repaged the original data (as I did and I'm doing well), there's no way Google can catch the duplicate, I may be wrong, but I find it very difficult.
| 9:55 pm on Apr 4, 2006 (gmt 0)|
Tedster: Banned means banned, delisted, blacklisted, censored, or "removed from the index". A site: search returns zero or one page. Sometimes the site: search returned a very small number of pages (like 29 in a 600,000 page clone) but that was not counted as "banned". Sometimes the site: search returned only links, where the search engine had not actually indexed the page but only indexed a link on someone else's page. That was not counted as "banned" either. If you count the sites that were obviously severely penalized but not actually totally delisted, the numbers would be substantially higher.
kevinpate: Yes it would be trivial for Google to find all the clones by a means just like you describe: look for the characteristic URL structure. But they don't seem to be doing that. Instead they are randomly banning sites that use any ODP data, regardless of how it is used. Possibly they are banning sites containing ODP data or other directories that are nominated by an enemy, competitor, or other person that doesn't like the site via the "spam report" form.
| 10:06 pm on Apr 4, 2006 (gmt 0)|
|If you reorganized, reclassified and repaged the original data (as I did and I'm doing well), there's no way Google can catch the duplicate, I may be wrong, but I find it very difficult. |
I agree. I have used ODP to locate additional sites but I do not use their dead links nor do I use their listings as an exclusive source. Probably most importantly I do not use their descriptions. My site was banned for seven weeks last year for reasons I can only theorize about, but it is doing fine now.
If sites that are purely ODP clones disappear I'm pleased about it.
| 10:23 pm on Apr 4, 2006 (gmt 0)|
fischermx: DMOZ wants you to put three links back to DMOZ on EVERY page that contains DMOZ data thereby creating hundreds of millions of backlinks from all those clones. I consider that to be another sleazy practice no doubt earning them even more love and affection, NOT, from search engines. We do provide text attribution but not the three links.
As indicated earlier, the banning pattern suggests that Google is not using any sophisticated method to find the directory sites. I envision a room full of people (I guess you would call them censors) sitting in front of monitors and manually going through sites that are nominated by the spam form or maybe more general mechanical means. The censors are careful not to ban a site that looks like it is big enough to have lawyers, publicists, or friends on Capitol Hill. That is why it has to be manual.
You better hope nobody drops a dime on you. Good luck.
| 10:26 pm on Apr 4, 2006 (gmt 0)|
>>Kinderstart, (the company suing Google), does not use ODP data in their directory.
No, they don't use ODP data but they have deliberately hogged PR by NOT giving OBLs, but rather having them open up in their own frames. Since when is that actually a "link to a site"? Some would call it a linking scheme designed to manipulate the rankings - for which they didn't get banned (thrown out altogether) but it seems to me that any "authority" score they had went right down the toilet for their spammy practices.
| 10:27 pm on Apr 4, 2006 (gmt 0)|
If the sites in this study truly are 'banned' and pages not simply being filtered out of the SERPs, I have a couple of quick questions.
Are the sites related in any way? If so, is there any inter-linking among the sites? Do they happen to be in the same industry? If not related, do they share any other characteristics? Maybe the same directory or CMS software? Other footprints of some sort (besides the obvious ODP prints)? Do the sites primarily use ODP data or is there a considerable amount of other data content?
I guess what I'm trying to get at is if there is anything else in common that might be causing a ban and not the normal filtering of dupe content.
| 10:32 pm on Apr 4, 2006 (gmt 0)|
|I envision a room full of people (I guess you would call them censors) sitting in front of monitors and manually going through sites |
He, he. You might not be far off. That sounds exactly like that Google 'quality control' stuff that popped up awhile back. One of the facets of the program supposedly was to rank pages by the value they added to the index.
Nobody really knows that much about it, but from what was seen I don't think anything would result in an actual ban, just a downgrading of page's importance.
| 10:38 pm on Apr 4, 2006 (gmt 0)|
GREAT NEWS! if true
About time - duplicate content after duplicate content, not to mention the zillions of backlinks all these clones produce - and what exactly is the point of them other than to fill the serps with more useless junk that we desperately need removing!
If ever there was a way to GAME Google it was via DMOZ editors listing their own sites about three or four times in the directory, ignoring other competing sites and letting all the clone sites give them backlinks after backlinks pushing them up in Google SERPS without any substance.
I just truely hope this is true and that they are after all this time waking up to this biased DMOZ problem. If so i will raise a glass to Google, thus cant come soon enough imo.
Lets hope in addition they realise its time to drop using the outdated directory data all together - no one uses it anyway other than the editors!.
If i need to find something i use a search engine!
| 10:42 pm on Apr 4, 2006 (gmt 0)|
Does this matter? I mean, I have every sympathy for anyone who has found their site dropped or de-listed for innocent reasons. However, DMOZ is strictly yesterdays news.
Most webmasters submit to DMOZ because it's a free link, not to add to the utopian ideal they spout.
More importantly, they have nothing like 70,000+ editors. A more accurate figure is <3000.
Directories made sense pre-Google, but monolithic beasts as badly run as DMOZ are dead, and have been for some time. Any search engine that thinks they have value (and that would be none of them) certainly doesn't want their content slavishly reproduced all over the web: it adds absolutely no value.
Over the last few years Google's clone of DMOZ has slipped off their front page, principally because nobody clicked it. Its days must be numbered. The frequency that they refresh it is also dropping and the PR figures don't make a lot of sense i.e. it doesn't get much attention from Google.
As for people copying chunks of it and expecting to benefit from other peoples efforts, well what do you expect? All gravy trains stop eventually.
[edited by: tedster at 4:53 am (utc) on April 5, 2006]
[edit reason] off topic [/edit]
| 12:14 am on Apr 5, 2006 (gmt 0)|
|A site: search returns zero or one page. |
Thanks, Altair. I just want to be sure we're all talking about the same thing here. We recently had a discussion about the Big Daddy "supplemental club" that got almost completely lost in the fog of imprecise language.
The next factor I'd like to study (and I will as soon as I can) is what is the remainder of the content on these banned DM clones. In other words, why are some apparently banned and not others; what might be the common factor, since DMOZ data alone doesn't do it.
| 12:16 am on Apr 5, 2006 (gmt 0)|
jimbeetle: The sites were taken from the DMOZ alphabetic list of sites "using ODP data". It didn't seem likely they were related any more than any other random sample.
"Quality" did not seem to be a factor. Some of the sites that were not banned had pop up ads and other obnoxious features that the banned sites did not have.
[edited by: tedster at 4:36 am (utc) on April 5, 2006]
| 12:24 am on Apr 5, 2006 (gmt 0)|
|The next factor I'd like to study (and I will as soon as I can) is what is the remainder of the content on these banned DM clones. In other words, why are some apparently banned and not others; what might be the common factor, since DMOZ data alone doesn't do it. |
tedster, there's a published paper out there that talks about using file structures/filepaths for detecting duplicates, near duplicates or mirror sites. I believe Andrei Broder may possibly have authored or co-authored it.
Bingo! Here's a good one. Krishna Bharat, Andrei Z. Broder: Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content.
I'd say the ODP clones are none other than replicated content, so this isn't a bad place to start researching.
| 1:27 am on Apr 5, 2006 (gmt 0)|
|talks about using file structures/filepaths for detecting duplicates, near duplicates or mirror sites |
That was one of the things going through my mind as I clicked through to this thread. It's a very easily discernible footprint.
Altair, sorry if I missed it somewhere in this thread, but not sure if you replied to Tedster's question about PR. And not just for the ODP-generated pages, but other pages on the affected sites. Just trying to see if it's a site-wide thingy or restricted to particular pages.
| This 113 message thread spans 4 pages: 113 (  2 3 4 ) > > |