homepage Welcome to WebmasterWorld Guest from 54.198.94.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Directories
Forum Library, Charter, Moderators: Webwork & skibum

Directories Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
DMOZ ODP data real numbers
This is sad.
Marcos




msg:487389
 12:37 am on Nov 7, 2002 (gmt 0)

- Number of sites/urls in ODP Dump:
3.432.089

- Number of no-duplicated sites/urls in ODP Dump:
3.122.622

- Number of unique domain names:
1.884.225

- Arround 10% of those domain names are redirections to spam sites, afiliates, and link farms. We think that maybe a similar number of ODP sites are actually expired domain names, registered by spammers.

The existing ODP data needs some serious workout, and the domain count is way too low: There are arround 39 million registered domain names out there, waiting to be clasified.

This is sad.

 

hurlimann




msg:487390
 12:47 am on Nov 7, 2002 (gmt 0)

>There are arround 39 million registered domain names out there, waiting to be clasified.

You presume dmoz should list all urls. That is not it's aim. Many domains are not worth listing.

>Arround 10% of those domain names are redirections to spam sites and link farms.

Where did you get that figure? I doubt it is that high and as it is edited by humans most spam never get's in.

Even the crawler back up available to dmoz editors sniffs out any sort of redirection and tags for futher investigation.

heini




msg:487391
 12:48 am on Nov 7, 2002 (gmt 0)

Ooh - now wait - I wouldn't want a directory with 39 Mill. domains to wade through... ;)

>added>It has to be said however some areas are way under represented in dmoz.

[edited by: heini at 12:50 am (utc) on Nov. 7, 2002]

Bobby_Davro




msg:487392
 12:50 am on Nov 7, 2002 (gmt 0)

But what percentage of those 39M domains have useful content? I know that many people have problems getting their sites listed in the ODP, but I suspect that many more are added within a reasonable timespan.

I frequently go looking for domain names for clients/new ventures and there are a ridiculous number of registered domains with nothing on them. Even worse are those domains bought in bulk that redirect to an email or hosting company.

I bet that there aren't actually that many quality sites that aren't already listed.

skibum




msg:487393
 1:32 am on Nov 7, 2002 (gmt 0)

There is content worth listing on about 15 percent of domain owned here.

rcjordan




msg:487394
 1:49 am on Nov 7, 2002 (gmt 0)

I own about 300 domains, only 35 or so have content, another 10 are "service" domains (rate cards, how to advertise, contact info, etc). The rest aren't activated.

europeforvisitors




msg:487395
 2:18 am on Nov 7, 2002 (gmt 0)

I bet that there aren't actually that many quality sites that aren't already listed.

I think there are. Not all quality sites are created by people who know about DMOZ, who understand DMOZ's relationship with Google, or who are interested in enough in chasing traffic to submit their sites.

Remember, a "quality site" may simply be a Ph.D. thesis or a collection of highly specialized links that someone has posted on the Web--often in a subdirectory of a university department's Web site. Such sites may not even have been created with the public in mind, but they're extremely valuable to people who are searching for information on certain topics.

Dante_Maure




msg:487396
 3:09 am on Nov 7, 2002 (gmt 0)

There are arround 39 million registered domain names out there, waiting to be clasified.

newly registered domain names per day: 39,900
(27.7 newly registered domain names per minute)
Source: WhoIsReport, Aug 2002

You think DMOZ should be trying to keep up with that? lol

A solid directory's goal is to be a valuable resource to it's users by offering up valuable on topic content... not to try to keep up with the number of domain names registered, or websites launched.

If another Directory is going to come along and make a dent in the Yahoo/Dmoz game it's not going to be because of more indexed pages.

It's going to be due to an even higher standard in editorial policy and better maintanence of it's existing links.

Marcos




msg:487397
 4:00 am on Nov 7, 2002 (gmt 0)

Hi hurlimann,

>You presume dmoz should list all urls.

Oh, not at all, although it will make a lot of SEO guys happy :)

I’m just mentioning that 1.884.225 vs. 39.000.000 is quite a difference. Do every single site "deserve" to be listed? Maybe not. Does any single business deserve a entry on the phone Yellow Pages? Maybe. Where is the balance? Who gets in, and who does not? Well, not many, it seems ;)

>>>Arround 10% of those domain names are redirections to
>>>spam sites and link farms.
>Where did you get that figure? I doubt it is that high and
>as it is edited by humans most spam never get's in.

You just need to follow the redirection’s. Using a unmarked crawler you’ll find about 150.000 of them. Some times is just a Geocities "Sorry, no longer exist" page, and some times is a completely different domain name, unrelated to the original listing.

Spam only gets in when the editor is the spammer, that’s right. But, after the editing is done, time pass, domain names get unregistered, and new guys register it. Apparently spammers use the ODP data to find out about unregistered domain names. Those domains get instant traffic, and a good Google listing. No big deal, you should know about that.

Humans do it better, but only once, apparently. It seems that no all of them edit every single site again, monthly, hunting for sites changing ownership. That is not a surprise, I guess.

europeforvisitors




msg:487398
 4:13 am on Nov 7, 2002 (gmt 0)

Spam only gets in when the editor is the spammer, that´s right.

Or if the editor is new and hasn't learned to distinguish spam from legitimate sites.

I'm a DMOZ editor with a very small category, and I count myself lucky that I get hardly any spam submissions. It must take a very special kind of person to work in a category that attracts a never-end flood of spam submissions--probably somebody who enjoys the challenge of separating the wheat from the chaff!

Side note: I'm starting to see an increase in spam submissions to my own site. I make it very clear on my "Submit a URL" page that I don't link to accommodations sites or affiliate sites, but a surprising number of Webmasters ignore my warning and submit anyway. I'm really getting fed up.

rfgdxm1




msg:487399
 4:46 am on Nov 7, 2002 (gmt 0)

>The existing ODP data needs some serious workout, and the domain count is way too low: There are arround 39 million registered domain names out there, waiting to be clasified.

And, how many of these 39 million registered domain names out there ever bothered to apply to the ODP? And, how many deserve inclusion? Admittedly the ODP does have some serious problems. Mostly in the commercial cats that get spammed heavily. Who wants to slog through all of that for no pay? I'm a low level ODP editor in a non-commercial cat I can't imagine ever will have a spam problem. And, I sure as heck wouldn't even want to consider being an editor of a highly spammed cat.

Brett_Tabke




msg:487400
 11:10 am on Nov 7, 2002 (gmt 0)

I bet that there aren't actually that many quality sites that aren't already listed.

I completely agree. You could slice 30% of the odp urls out and it would be a better directory for it. I'd like to see the ODP start regularly shutting off submissions at random intervals.

There are categories that are trying to list every conceivable tidbit of info around a category. That's nonsense. It's not a search engine, it's a directory!

There has to be some quality control. Editors are far too worried about whether a listing needs a period or a comma than they are about whether the site needs to be in there in the first place.

Also, I would love to see a yahoo style "what's new" page so we could track bonehead additions a little better.

rfgdxm1




msg:487401
 11:27 am on Nov 7, 2002 (gmt 0)

>I completely agree. You could slice 30% of the odp urls out and it would be a better directory for it. I'd like to see the ODP start regularly shutting off submissions at random intervals.

Sounds like a bad idea to be shutting off submissions at random intervals. However, I do agree that selectivity does make sense for a directory. Although, I've gotta wonder what logic can be used for selectivity WRT commercial sites. Basically they are all the same: they are shilling something.

Dumpy




msg:487402
 4:03 pm on Nov 7, 2002 (gmt 0)

It appears that DMOZ is caving in on itself, by being too successful.

ODP Unreviewed / 984104 sites

How can this problem EVER be resolved.

Perhaps everyone should be made an editall



Bobby_Davro




msg:487403
 4:24 pm on Nov 7, 2002 (gmt 0)

I think that obvious solution is to pay editors for each review. Charge commercial submissions a $12 fee and use that to subsidise the non-commercial submissions. Pay an editor $4 per review. That way you can have 2 non-commercial reviews paid for every time a business submits a site.

$4 per review is a pretty good amount given the time it takes to review a site for inclusion in the ODP. I bet the review rate would soar if there was some money involved.

rogerd




msg:487404
 4:37 pm on Nov 7, 2002 (gmt 0)

I agree that paid reviews would cut the backlog in a hurry (in conjunction with expanding editing privileges for proven editors). I think preventing abuse might be a challenge, though. A hungry editor could blast through a hundred submittals with a cursory glance and minimal edits, netting a tidy paycheck but not helping the directory much. Something like this happened with the old GoGuides directory, where your editing powers increased as you processed more sites - some editors seemed to go on a volume binge.

Marcos




msg:487405
 5:20 pm on Nov 7, 2002 (gmt 0)

Hi Brett,

>There are categories that are trying to list every
>conceivable tidbit of info around a category. That's
>nonsense.

I beg to disagree. If you are looking for, say, a restaurant in Las Vegas, you would like to have as many as possible listed and categoriced. Who decides that small mom and pop place you like does not deserve a listing? The editor? Well, maybe he/she just don’t have a taste for home made food as you do, and you would like to find the place nest COMDEX, don't you? A Search engine will not provide such info. That's the directory job.

In most categories, the more sites indexed and RANKED the better. Some times you will feel like following the "Editor's Choice", some times you will not.

Quality sites are ranked first. Non quality sites are at the bottom. That’s easy. But, what about the spam, you’ll said. We think spam MUST be categorised... as spam.

You may think a spam or affiliate sites category is worthless. Think again. A XML file containing undesirable sites does has a lot of value. Actually, it would be priceless for email filters, corporate and personal firewalls, censorware, and even as a do-not list for search engines. Such a categorie would be a worthy tool for a number of us tech folks.
We do believe categories like “popup from hell websites” or “worthless adult mirror sites” do have a place in the Directory world. :)

rafalk




msg:487406
 6:08 pm on Nov 7, 2002 (gmt 0)

984104 sites

That figure is grossly inaccurate to say the least.

Marcos




msg:487407
 6:09 pm on Nov 7, 2002 (gmt 0)

>ODP Unreviewed / 984104 sites

where did you get that from?

Dumpy




msg:487408
 6:26 pm on Nov 7, 2002 (gmt 0)

I was given the following as of Nov. 5, 2002:

Root Level Categories
Rank Category Total Unreviewed
Overall Group Sites Sites %
1 1a World 848729 264,903 31 %

2 2a Business 223428 186,631 84 %

3 3a Arts 288406 84,834 29 %
4 3b Bookmarks 222099 80,321 36 %

5 4a Shopping 101915 68,956 68 %
6 4b Regional 691453 58,868 9 %

7 5a Adult 106096 52,665 50 %
8 5b Society 130251 43,605 33 %
9 5c Recreation 87389 32,467 37 %

10 6a Health 62701 18,169 29 %
11 6b Science 81089 15,229 19 %

12 7a Sports 83285 13,020 16 %
13 7b Reference 57105 11,676 20 %
14 7c Games 48280 11,421 24 %
15 7d Home 36149 9,594 27 %
16 7e News 47647 5,357 11 %
17 7f Kids and Teens 14790 2,725 18 %

18 8a Complete ODP 1269 646 51 %
19 8b Test 86 383 445 %
20 8c Computers 54 5 9 %

Total Unreviewed 984104

Marcos




msg:487409
 6:34 pm on Nov 7, 2002 (gmt 0)

>I was given the following as of Nov. 5, 2002:

joder!

pd: you don´t want to know.

rogerd




msg:487410
 7:07 pm on Nov 7, 2002 (gmt 0)

One might think this backlog would create a sense of urgency... But I'm sure there is a legitimate concern that the damage done by bad editors would take much longer to correct than a better editor working through the backlog a month later.

Bobby_Davro




msg:487411
 9:18 pm on Nov 7, 2002 (gmt 0)

rogerd,
Yes, it may cause problems with editors dealing with submissions en masse, but like any business you would have to have some quality checking. People not doing their job properly get the sack and don't get paid, as long as you make the quality of their reviews a requirement for payment.

jmccormac




msg:487412
 10:14 pm on Nov 7, 2002 (gmt 0)

In msg 2 (reply to Marcos' post #1) hurlimann posted:
">There are arround 39 million registered domain names out there, waiting to be clasified.

You presume dmoz should list all urls. That is not it's aim. Many domains are not worth listing.
"

Traditionally the .com/.net/.org figures have had about a 75% delegation rate. My own work on the .info domain during the summer produced results with roughly the same percentage. Of the domains registered, about 75% will be properly set up with nameservers responding and proper SOAs. The number of active websites drops considerably from that 75%. The number with active, useful and continually updated content could be in the region of 30%.

">Arround 10% of those domain names are redirections to spam sites and link farms.

Where did you get that figure? I doubt it is that high and as it is edited by humans most spam never get's in. "

I'd have to agree with Marcos on this one. Link farms or link swamps have certain characteristics that allow you to identify them very quickly. For example, one well known domain squatter has easily identifiable nameservers. Thus if the DNS for a particular domain listed in Dmoz points to these nameservers then the domain has been snatched after it expired. This particular company has a robots.txt that excludes all spiders. Thus, robozilla being a polite spider, will not reindex the new, cybersquatted website and the Dmoz entry stands. This is a problem that Dmoz has yet to solve and it could be solved fairly easily. I'll run some basic link swamp checks on the Dmoz domains tonight and post the results.

The human editors in Dmoz mostly do a good job. However concentrating on directory spam tends to allow the directory to be polluted via the backdoor in the way these link swamps do it.

"Even the crawler back up available to dmoz editors sniffs out any sort of redirection and tags for futher investigation."

Too much faith in technology is a bad thing. :) I haven't seen the code for Robozilla and I don't know if it is smart enough to detect subtle changes. Besides if the robots.txt of the site being spidered blocks all spiders, then Robozilla will not spider it. In another thread, someone mentioned that the Robozilla had a two month cycle. This would potentially allow a cybersquatter 3 months of usage out of the domain. A lot of the newer registrars are on a 45 day deletion of unpaid domains and as a result, the two month cycle is too long to reliably detect changes. It is possible to design linkswamp detection into the structure of Dmoz. Actually it is a very easy thing to do and it would allow link swamps to be detected and eliminated automatically. However Dmoz would have to choose a Day 0 for the index and proceed from there. From the delayed updates, it appears that more important things, such as the structure of Dmoz, come first.

I am not a Dmoz editor but I do run a few search engines/directories and domain usage analysis is part of what I do on a daily basis. (Luckily the .ie domain is so small - otherwise I could not find the time to follow these threads. :) )

Regards...jmcc

Marcos




msg:487413
 11:33 pm on Nov 7, 2002 (gmt 0)

Hi jmccormac,

>A lot of the newer registrars are on a 45 day deletion of
>unpaid domains and as a result, the two month cycle is too
>long to reliably detect changes.

God point. Tha's probably the easiest way to catch them.

>I haven't seen the code for Robozilla and I don't know if
>it is smart enough to detect subtle changes.

Actually, it is very difficult task. Many of those spam sites behave in a different way when accessed by Robozilla, Googlebot, and Mozilla. You can’t just use a slightly unpolite robot. If you really want to know what is going on, you will need a truly rouge crawler, a crawler who could impersonate Googlebot, Robozilla and a XP user, to compare results. Unfortunately, we are not allow to do that, because such a crawler could be infringing others intellectual property. That’s why we have to "guess" the 10% number, using more inefficient methods, and that’s why Robozilla can’t smart up that much.

hurlimann




msg:487414
 12:10 am on Nov 8, 2002 (gmt 0)

>Too much faith in technology is a bad thing.

Agreed jmccormac and I don't but I was amazed at the tools available to dmoz editors.

Robozilla has it's limitations but become an editor and you can access a whole load of very clever 3rd party tools that address some of the issues you refer to.

Linkswamp is currently the problem of the month but Auto linkswamp detection is hard.

2 examples:

A) A high PR site is taken over and one link and three keywords are added for PR value. Detect this and you detect all sites that are updated. The result would overwelm the editors. Don't detect it and the spammer wins.

b) Brand A sells out to Brand D. Detect this, treat as spam and be sued.

One mans SEO is anothers Spam!

rfgdxm1




msg:487415
 2:50 am on Nov 8, 2002 (gmt 0)

>ODP Unreviewed / 984104 sites

>How can this problem EVER be resolved.

I brought this up in an internal ODP editors forum, and nobody seemed to have any real solutions. The problem as I see it is this. The ODP is a bunch of volunteer editors. This system can work fairly well when it comes to non-commercial cats. This is because #1) it is easier to find people to volunteer for a hobby interest, and #2) these non-commercial cats don't tend to be overwhelmed by spam. Thus reviewing submissions for non-commercial cats isn't that much work. However, who wants to slog through a huge pile of unreviewed commercial submissions for no pay, many of which are just spam? It ain't me, babe. There is a structural reason that there is a backlog of a million sites in unreviewed.

jmccormac




msg:487416
 7:54 am on Nov 8, 2002 (gmt 0)

in msg #25 Marcos posted:
">A lot of the newer registrars are on a 45 day deletion of
>unpaid domains and as a result, the two month cycle is too
>long to reliably detect changes.

God point. Tha's probably the easiest way to catch them.
"

It should be simple enough Marcos - the editor would just run 'host -tns domname.com' and the results would be stored as a check.It is easy enough to automate. When Robozilla is run, any website with changed nameservers could be flagged for attention.

The IP of a website could be stored as well though this would cause problems with multi-homed sites. Linkswamps tend to show up easily because of the huge number of sites pointing to the same IP. However some editorial oversight would be required on this because often smaller webhosting companies can have up to 200 websites on a single IP. The better way would be to use the DNS data along with the IP address as a set of rules to check if the site is hosted on a known linkswamp.

The spam is a bit more difficult to handle. While the DNS/IP rules can help, it seems to be more of a moving target.

The rogue spider idea is probably the best way to determine if the site is that of a spammer. However it would need a lot of different IPs to be effective. One linkswamp company that I have just checked has 95289 .com/net/org domains. I am comparing these domains with websites listed in the Dmoz RDF and a very definite pattern of hits is emerging.

Regards...jmcc
(About half-way through the A domains and so far, 172 hits have been detected in the Dmoz RDF.)

[edited by: jmccormac at 8:20 am (utc) on Nov. 8, 2002]

jmccormac




msg:487417
 8:16 am on Nov 8, 2002 (gmt 0)

im msg 26 hurlimann posted:
"
Linkswamp is currently the problem of the month but Auto linkswamp detection is hard.
"
From a starting point it can be difficult to detect hurlimann.

The sheer number of .com/net/org domains can appear to be an unmanageable problem when you see them all at once. :) But when you start to break them down, it becomes a lot clearer.

But once the characteristics of a linkswamp are established (the DNS data and website IP data), it becomes a lot easier to spot linkswamps. The key is to identify the patterns associated with linkswamp activity. The danger is getting linkswamps mixed up with mega-hosters.

"2 examples:

A) A high PR site is taken over and one link and three keywords are added for PR value. Detect this and you detect all sites that are updated. The result would overwelm the editors. Don't detect it and the spammer wins.
"

This would be the case when you only look at the webpage data. By using the DNS/WWW data to verify that it is still the same site, some of the false positives could be eliminated and the editors' jobs would be made easier. The problem though is that the data in Dmoz would have to be pre-processed/cleaned and a Day 0 chosen. Then all entries in Dmoz could have the relevant DNS/IP data stored with the entry data. The DNS/IP part is actually very simple to do - it would take some time to complete. The other aspect, cleaning the dataset, is the hard part.

"
b) Brand A sells out to Brand D. Detect this, treat as spam and be sued.
"

This would be flagged by the process above and it would be up to the editor to verify. Again it may not scale well on a directory wide basis but used with automatic linkswamp detection and various other DNS/IP rules that would be built up over time, it would be possible to handle.

It may be possible to state in the terms and conditions of submission that websites that are sold/transferred have to be resubmitted. (nasty legal escape clause. :) )

Regards...jmcc

Marcos




msg:487418
 11:44 am on Nov 8, 2002 (gmt 0)

Hi jmcc,

>(About half-way through the A domains and so far, 172 hits
>have been detected in the Dmoz RDF

Take it easy. It took us 3 weeks and 5 software versions ;)

This 31 message thread spans 2 pages: 31 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Directories
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved