homepage Welcome to WebmasterWorld Guest from 54.234.2.94
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Directories
Forum Library, Charter, Moderators: Webwork & skibum

Directories Forum

    
Why doesn't DMOZ use automated selection solutions?
The Contractor




msg:3391437
 12:06 pm on Jul 11, 2007 (gmt 0)


System: The following 6 messages were cut out of thread at: http://www.webmasterworld.com/directories/3389478.htm [webmasterworld.com] by webwork - 7:05 am on July 12, 2007 (utc -5)


But don't forget this: the majority of websites aren't listable. You might have to review 3, 5, 10, or 1000 websites (depending on the category) to add one listing. The ODP gets about 10-40 suggestions for every listable site.

I often wondered why they didn't "fix" this. Let's face it, 99% of "listed" deeplinks are added by an editor (wild guess), not by the submission process. I never quite understood why the "system" allows people to submit their site/url to 10K categories if they wanted to, along with each individual page. This would save an enormous amount of time and system resources by "fixing" this single issue. Maybe they have already fixed this or improved the backend in some way and I don't know about it.

DMOZ is the "best" directory out there in my opinion. It has its flaws in both management and execution, but the dedicated volunteer editors more than make up for this.

 

Rosalind




msg:3391612
 3:56 pm on Jul 11, 2007 (gmt 0)

I often wondered why they didn't "fix" this. Let's face it, 99% of "listed" deeplinks are added by an editor (wild guess), not by the submission process. I never quite understood why the "system" allows people to submit their site/url to 10K categories if they wanted to, along with each individual page. This would save an enormous amount of time and system resources by "fixing" this single issue. Maybe they have already fixed this or improved the backend in some way and I don't know about it.

It's almost certainly down to a lack of automated solutions, and I can only imagine that this is a political decision rather than one based on not having the capabilities.

For instance, it's quite easy to filter submissions and grade them according to the likelihood that they have followed the guidelines. You can examine a number of things in a submission from the category, the text, the domain extension and it's likely suitability for that category, numbers of previous submissions, whether it's a deep link, and so on. Then you sort the submissions in the order of least spam, so that those submissions that look okay to the script get priority. The more factors that go into this, the better it works, much like a SE algorithm. It's surprising how effective such an approach can be.

hutcheson




msg:3391750
 6:31 pm on Jul 11, 2007 (gmt 0)

Well, in every community the choice of community tools is always a "political" process. You can generate tools all day long, but if the people don't find them useful, they don't get used. On the other hand, if the people find tools useful, they'll ask for more of them.

You can assume that the ODP community is no different. Where people see the value in an automated approach, there will be an attempt (often multiple attempts) to create the automation. And there are always ongoing discussions of what can be automated, and how well the current tools work.

One early ODP example: Some of you may be old enough to remember when Yahoo was 10% or more dead links (back before before link ghouls or annual directory renewal fees, so dead links stayed dead and buried.) Robozilla was the first link rot checker for any major directory.

Internal automated spam approaches, we don't publicly discuss, because of the fact that, given access to quick turnaround on penetration tests, spammers can trivially breach any automated antispam approach. (For instance, consider the amount of marketing spam that passes through all of GOOGLE'S highly sophisticated automated tests--a regular complaint here.) But as long as spammers don't know what happens to their submittal spam, they can't adapt; they'll just keep wasting effort on techniques that don't work. I don't feel the lack of any specific automatic filters.

The Contractor




msg:3391752
 6:33 pm on Jul 11, 2007 (gmt 0)

It's almost certainly down to a lack of automated solutions, and I can only imagine that this is a political decision rather than one based on not having the capabilities.

Sure it is, but I believe things can change without changing/abandoning the original intent of the ODP. There are way too many editor and system resources being used on this issue.
A simple "I’m sorry, the URL you suggested has already been suggested for review in CategoryName" or "I’m sorry the URL you suggested is already listed in CategoryName". This would solve multiple problems which all stem from multiple submissions. The problem has been in the conflicting info of "Identify the single best category for your site. The Open Directory has an enormous array of subjects to choose from. You should submit a site to the single most relevant category." and allowing multiple submissions of the same url/site to multiple categories.

hutcheson




msg:3391892
 9:41 pm on Jul 11, 2007 (gmt 0)

>This would solve multiple problems which all stem from multiple submissions.

It wouldn't solve editors' problems. It would do a great deal to solve professional spammers' single biggest problem ... which is the reason editors would get along without it, even if it was one way to solve a problem we really had.

The Contractor




msg:3392362
 11:20 am on Jul 12, 2007 (gmt 0)

It wouldn't solve editors' problems.

It certainly would help when some editors internet connections time-out trying to edit in categories where the unreviewed queue is full of duplicate submissions and deeplinks.

True Story: Last winter a company contacted me about doing some development. I took on and completed the project. They asked me to look at another of their clients' site and try to solve the problems they were having ranking etcetera and "why we can’t get into DMOZ". Nothing wrong with the 6-year old site meeting DMOZ requirements as it was a legitimate manufacturing business (no ecommerce). They had been religiously submitting every 3-months to five different categories of dmoz under Shopping for over three years. After a lengthy explanation of how DMOZ works I went to look for the correct category and found a perfect match under a small "subcategory" of Top: Business: Industrial Goods and Services: Factory Automation. Guess what, they were listed there....

I think a simple notification upon trying to submit the site would have gone a long way in curbing what in many cases is considered submission spam. This development company followed this same mentality with their other clients and had a decent size client base. They later told me they found four other clients sites listed in other categories they were unaware of....

The above is simply a case of where people do not read the submission guidelines, and even when they do they figure if example.com can have "1460 deeplinks" why can’t their site have five or six or ten... You cannot expect submitters to understand what even many editors have problems with...can you?

It would do a great deal to solve professional spammers' single biggest problem

Please explain as that is a pretty broad statement.... How could the following quote from my other message help spammers with their submissions?

A simple "I’m sorry, the URL you suggested has already been suggested for review in CategoryName" or "I’m sorry the URL you suggested is already listed in CategoryName". This would solve multiple problems which all stem from multiple submissions.


even if it was one way to solve a problem we really had.

So the problem I’m speaking of doesn’t even exist......denial of problems will never solve them.

I hope my comments are not misconstrued as attacking dmoz or the editors in any way.

gpmgroup




msg:3393030
 8:31 pm on Jul 12, 2007 (gmt 0)

The simplest answer would be to profile submitters.

Create a whole new class of submitter accounts. Sure there would be a range of profiles from spammers to scholars. But the editors would quickly learn which profiles were contained the valuable nuggets on a regular basis. - Far more efficient than wading through an open spam ridden queue day in day out.

In fact if the queues contain as much toxic sludge as some of the editors suggest its not surprising editors choose not to visit the queues very often.

To anyone with a genuine interest in the subject matter they are editing queues with lots of nuggets on relevant sites would be far more interesting/appealing/satisfying (not to mention more efficient use of time) than manually searching or surfing for sites.

jtbell




msg:3393540
 11:25 am on Jul 13, 2007 (gmt 0)

It certainly would help when some editors internet connections time-out trying to edit in categories where the unreviewed queue is full of duplicate submissions and deeplinks.

As I understand it, when a suggestion arrives for a URL that has already been suggested, it replaces the older suggestion, so there are never any true duplicates. However, this doesn't prevent multiple deeplink suggestions.

Rosalind




msg:3393580
 12:12 pm on Jul 13, 2007 (gmt 0)

As I understand it, when a suggestion arrives for a URL that has already been suggested, it replaces the older suggestion, so there are never any true duplicates. However, this doesn't prevent multiple deeplink suggestions.

One solution for this would be to disallow all deeplink submissions via the add-url form, but allow editors to add deeplinks. Then compile a list of exceptions to this rule for websites such as Blogger, Geocities, and other providers of free hosting. And for those exceptions, allow only one deeplink per account (so you would get only username.example.com, rather than username.example.com/something.htm, username.example.com/other.htm, etc).

The Contractor




msg:3393810
 3:30 pm on Jul 13, 2007 (gmt 0)

As I understand it, when a suggestion arrives for a URL that has already been suggested, it replaces the older suggestion, so there are never any true duplicates.

When I made the statement about duplicates, I was referring to a url/site that is already listed in or sitting in an unreviewed in another category. If the site is listed in a topical and regional category (if relevant) already, there is very little hope that the submission will be approved by an editor (in most categories). Why allow submissions of such to clog up the unreviewed queue.

This could easily be handled as Rosalind describes.

hutcheson




msg:3394934
 5:33 am on Jul 15, 2007 (gmt 0)

>When I made the statement about duplicates, I was referring to a url/site that is already listed in or sitting in an unreviewed in another category.

"Site submittal status reports" are the same thing whatever fancy language you wrap them in. They've been discussed often before. Spammers all want them desperately. People who aren't follow the submittal policy, are always ready to swear they'll follow it religiously henceforthward, if only ... Well, one in a hundred may be telling the truth, but most of them demonstrably aren't. (So, of course, none of them are, or ought to be, believed.)

And people who just follow the submittal policy have no possible use for suggestion status.

And of course editors have no use for that information -- if we're determined to add a site to a category, it really doesn't matter whether it was suggested or not. And if we haven't determined to add the site, then it still doesn't matter whether it was suggested or not.

hutcheson




msg:3394935
 5:40 am on Jul 15, 2007 (gmt 0)

>The simplest answer would be to profile submitters.

The simple fact is, if profiling submitters helped find good sites, we'd already be using it to help find good sites. (It's been discussed often enough internally!) The problem is, the number of submitters with enough good suggestions to matter, is too small to matter. When we discuss it, there's always the first question: "Have you ever seen a good submitter? How can you recognize one?" And ... there's no pattern, probably because there are simply too few examples to spot any patterns that might exist.

gpmgroup




msg:3395138
 3:39 pm on Jul 15, 2007 (gmt 0)

When you first look a huge spam queue there are often no patterns. If everything pours into one place then even browsing the queue is soul destroying never mind working on it.

What I was suggesting is a new account for webmasters/archivists/academics/collectors etc. They could log into their submitters account much the same way as editors can and they can suggest sites relating to their passion(s)

Theses accounts would feed a queue which editors could browse and select from.

A submitter with a real interest in a subject or subjects would quickly become easily identifiable by the quality of the sites they suggest. Anyone using their account to spam would be quickly identified too, and anyone using a different account for each submission would be no better off than the current system.

You could also use the account just to say thank you to the submitters - A "hey thats a great list of sites" type thank you would often go a long way to getting another list from people with a real passion for a subject :)

Rosalind




msg:3395285
 8:12 pm on Jul 15, 2007 (gmt 0)

It would be fairly easy to measure the % of approvals, and give good submitters priority. The only problem would be, all of the good submitters would have to start out on the slushpile, with the assumption that they were spammers. Because what spammers would do would be to register for a new account every time. So whilst submitters who managed to become known quantities would get their submissions viewed much quicker, for anyone new (the majority) it wouldn't be any better than before.

hutcheson




msg:3395919
 3:53 pm on Jul 16, 2007 (gmt 0)

gpmgroup, I've spent quite a bit of time (over the past few years) thinking about something like what you suggest. But ... we have to face the facts: it's a solution looking for a problem.

What do you call someone with a passion about a subject, who is capable of earning the trust of the ODP volunteers?

WE call them "editors." They may also do other things, but if they show that they can be trusted, and they're doing a significant amount of work, then we don't want them "suggesting sites more efficiently from the outside", we just want them "just adding the sites."

And there simply aren't enough people who can't (or won't) be editors, and yet still fit the profile of "people who can be trusted to give good information of the kind the ODP wants."

So, you might as well spend your time figuring out how Martians can use the Mars Rovers' communications systems to access dmoz.org. You may come up with a clever, brilliant, or even elegant solution. But you can't find a problem to go with it.

The Contractor




msg:3396027
 5:19 pm on Jul 16, 2007 (gmt 0)

"Site submittal status reports" are the same thing whatever fancy language you wrap them in. They've been discussed often before. Spammers all want them desperately. People who aren't follow the submittal policy, are always ready to swear they'll follow it religiously henceforthward, if only ... Well, one in a hundred may be telling the truth, but most of them demonstrably aren't. (So, of course, none of them are, or ought to be, believed.)

I can only say comments like that never sieze to amaze me...hehe

Who fricken cares about site submission status...I certainly don't and anyone that knows better shouldn't either. I have not suggested a single site to dmoz in at least a couple of years...

I'm merely pointing something out, it can fall on deaf and denying ears for all I care (as it has).

gpmgroup




msg:3396243
 9:02 pm on Jul 16, 2007 (gmt 0)

gpmgroup, I've spent quite a bit of time (over the past few years) thinking about something like what you suggest. But ... we have to face the facts: it's a solution looking for a problem.

There is a huge problem with DMOZ.

Lets take an example site

We suggested a site over 4 years ago, This is a site is for a company over 100 years old has 11 offices employs over 100 people and manages 750,000 acres of the UK. (There is no advertising / mirrors / affiliates – Single site for a single UK company with a .co.uk domain registered 1999 & The site is currently averaging 1 million visitors a year.)

Nearly four years ago - Six months after submission. We were told the site had been moved by a “Meta editor concentrating primarily on the UK sections of the directory.” to the category above the one we submitted to.

We waited and waited quietly keeping an eye on the categories from time to time.

Earlier this year we followed your advice here in this forum after the DMOZ rebuild form the crash and resubmitted to the higher category. (Even though I personally still feel the site sits more naturally in the category below which we originally submitted to. But hey if the “Meta editor concentrating primarily on the UK sections of the directory.” Says it should be in the category above….)

The categories concerned have both been edited since that submission but our client’s site has not been listed. (Both categories are smaller than they were in 2006 [18:17 & 29:28])

Now that leaves several possibilities

1) DMOZ is inundated with submissions and or spam and doesn’t have time to add valuable sites in some areas.
2) The site is not good enough to be added to DMOZ
3) There are not enough editors
4) The Editors may have a personal interest in ensuring some sites can not be listed
5) The Editors have a personal interest in ensuring some categories remain small.

3 is unlikely as both categories have been edited on several occasions and was refuted is several threads including the recent one on DMOZ getting smaller [webmasterworld.com...]

4 or 5 wouldn’t matter for private directories but for DMOZ they are very important because they are at odds with the social contract and the principles of open source.

We will make every effort to build a high quality and comprehensive directory. We will make every effort to evaluate all sites submitted to the directory. However, we do not guarantee all submitted sites will get listed. We will be highly selective and judicious about sites we add, and how we organize them

If 4 & 5 are occurring it is not acceptable to pretend that they are not occurring and use 1 as an excuse to cover this up as this would be extremely misleading to companies who rely on DMOZ to provide data to base their algorithms on.

I therefore assumed 1 to be the case and I find it puzzling when you write

gpmgroup, I've spent quite a bit of time (over the past few years) thinking about something like what you suggest. But ... we have to face the facts: it's a solution looking for a problem.

hutcheson




msg:3396272
 9:34 pm on Jul 16, 2007 (gmt 0)

>DMOZ is inundated with submissions and or spam and doesn’t have time to add valuable sites in some areas.

You seem to assuming that DMOZ _has_ a global concept of "valuable site". But it doesn't. It has a global concept of "listable site", and several thousand different personal concepts of "interesting subject."

And, historically speaking, "industrial" categories have been over-represented in webmaster forums vis-a-vis volunteer editor concerns (you might, although I wouldn't, say they are under-represented in volunteer editor interests vis-a-vis the typical webmaster forum participant.

So it's not a matter of "too few" editors in the ODP in general. It's simply a matter of too few surfers interested in that particular topic. And the presence of spam is not necessarily an issue -- a topic may be intrinsically uninteresting, or it may be rendered boring by the presence of large volumes of spam. But remember, "uninteresting" or "interesting" are concepts defined by each editor

And you can't bank on "ODP time" because there isn't any such thing. There is only "individual editor time." The ODP is the union of individual editor interests.

This explains why 4 and 5 are not merely wrong -- they are inoperative. There literally aren't and can't be any common editor interests in that sense. The largest group of editors are American, and couldn't possibly have an interest in making sure that a British manufacturer isn't listed.

-----------------

But, in any case, I think you missed my whole point. Nobody disputes that there are listable sites, that have been suggested, and the suggestions are not always looked at immediately. But your proposal doesn't and wouldn't affect that fact at all.

What you proposed was that suggestions from TRUSTED suggestors would receive some sort of priority (or potential priority.) The problem was not that there are no good suggestions. The problem is that there is no class of suggestors that can be procedurally "trusted" -- that is, there is no way of building trust other than by the time-immemorial way: that is, by doing something right over and over until people figure out that you always do it right. And ... people who suggest enough sites to gain that kind of trust (remember, there is no OTHER kind of trust!) can qualify to become editors.

So, there is no need for ANOTHER mechanism to represent the existance of established trust. We have the "editor" mechanism.

gpmgroup




msg:3409052
 12:02 am on Jul 31, 2007 (gmt 0)

And, historically speaking, "industrial" categories have been over-represented in webmaster forums vis-a-vis volunteer editor concerns (you might, although I wouldn't, say they are under-represented in volunteer editor interests vis-a-vis the typical webmaster forum participant.
So it's not a matter of "too few" editors in the ODP in general. It's simply a matter of too few surfers interested in that particular topic.

This particular site isn’t an “industrial” site – In fact it has articles on very big widgets ;-) that most American’s try to see on their tour of the UK, and many if not most people dream of owning at some point in their life.

So to try and use FUD in this way to dismiss the failings of DMOZ is scraping the bottom of the barrel somewhat. :)


And the presence of spam is not necessarily an issue -- a topic may be intrinsically uninteresting, or it may be rendered boring by the presence of large volumes of spam. But remember, "uninteresting" or "interesting" are concepts defined by each editor

If there were sufficient editors you would expect these inadequacies to be ironed out.

But, in any case, I think you missed my whole point. Nobody disputes that there are listable sites, that have been suggested, and the suggestions are not always looked at immediately. But your proposal doesn't and wouldn't affect that fact at all.

Has to be the understatement of the year “and the suggestions are not always looked at immediately” cough ;) 4 and a half years :) That's like 30 years in the non internet world LOL

The problem was not that there are no good suggestions. The problem is that there is no class of suggestors that can be procedurally "trusted"

This shows a very closed mind approach. It isn’t about trusting anything implicitly or procedurally, its about establishing possible trust statistically by "grouping submissions from the same source to together”

And in doing so providing a more efficient use of editor’s time to find groups of sites.

So, there is no need for ANOTHER mechanism to represent the existance of established trust. We have the "editor" mechanism.

I have several interests so I looked at the categories for which I know the subjects inside out. These subjects are diverse; I could not realistically apply to be an editor for each of those sections of the tree. I could easily add 10 – 20 sites to several of those categories (Non of which I have any affiliation to or even have websites in those fields)

Out of the sites listed in DMOZ from all of the categories I found 1 site which was worth book marking that I had not seen before. In all of those categories it would be fair to say many of the DMOZ entries were often dated and/or academically light weight.

This isn’t a reflection the quality of editors it is simply the breath of subject matter editors who oversee large sections of the tree need to have to be able add quality sites. Without help they just can not hope to ever find the quality of sites needed to make DMOZ a first or even second point of search for most surfers.

Even with 75,151 editors quoted on the front page and could perhaps be considered misleading? that’s 8 categories each and with perhaps a more realistic “6000” editors that’s almost 100 categories each. (Assuming more than 5900 are actually interested in editing more than a handful of categories :))

hutcheson




msg:3409092
 1:13 am on Jul 31, 2007 (gmt 0)

>>And, historically speaking, "industrial" categories have been over-represented in webmaster forums vis-a-vis volunteer editor concerns (you might, although I wouldn't, say they are under-represented in volunteer editor interests vis-a-vis the typical webmaster forum participant.
So it's not a matter of "too few" editors in the ODP in general. It's simply a matter of too few surfers interested in that particular topic.

>This particular site isn’t an “industrial” site – In fact it has articles on very big widgets ;-) that most American’s try to see on their tour of the UK, and many if not most people dream of owning at some point in their life.

>So to try and use FUD in this way to dismiss the failings of DMOZ is scraping the bottom of the barrel somewhat. :)

And, of course, compared to the ODP, the MARKETING industry is of all industries MOST over-represented in forums such as these.

>And the presence of spam is not necessarily an issue -- a topic may be intrinsically uninteresting, or it may be rendered boring by the presence of large volumes of spam. But remember, "uninteresting" or "interesting" are concepts defined by each editor

>If there were sufficient editors you would expect these inadequacies to be ironed out.

The silliness of this can be seen by turning it around. If there were "sufficient webmasters", would MY full range of interests be represented, or would we just see more spam? So far as I can see, it's clearly the latter.

Or, more abstractly, the apparent difference in interests between the cooperative volunteer communities and the professional marketers is not an artifact of sample size: it represents a genuine social divide.

>>The problem was not that there are no good suggestions. The problem is that there is no class of suggestors that can be procedurally "trusted"

>This shows a very closed mind approach. It isn’t about trusting anything implicitly or procedurally, its about establishing possible trust statistically by "grouping submissions from the same source to together”

Remember, it takes a certain number of data points to establish any statistical validity. You're assuming that there do exist single sources that provide enough submissions to establish trust. And there aren't. So it's not a matter of procedural limitations. If your mythical super-submitter existed, we could consider how we might spot him most efficiently. But it simply doesn't exist. So the greatest efficiency gains are by not wasting any resources looking for him.

>I have several interests so I looked at the categories for which I know the subjects inside out. These subjects are diverse; I could not realistically apply to be an editor for each of those sections of the tree. I could easily add 10 – 20 sites to several of those categories (Non of which I have any affiliation to or even have websites in those fields)

Not so. If you could add 10-20 sites to even one category, you'd have a track record for applying to a second category. If you chose the first two categories to be "close together" taxonomically, the next step might well be the "common parent category" -- and from there, perhaps a medium-sized category that wasn't so close, or a remote small category.

And, because effective editing in one focussed category will lead to large numbers of sites that belong in "the next category over" (or to deep-content sites that can be deeplinked around the tree); and because INSIDE site submissions which ARE tracked for reputation purpose, a good editor can easily build a reputation leading to high-level permissions, even if he only edits in fairly focussed areas. A typical editall might have only 20-30 category applications on his log.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Directories
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved