New DMOZ Search Database

Forum Moderators: open

Message Too Old, No Replies

New DMOZ Search Database

created for those who need to check new listings

Dumpy

1:00 pm on Nov 26, 2002 (gmt 0)

As a curtesy I have put up a search engine database current as of this weekend. I will keep it up until DMOZ is able to crawl their own site.

Sticky mail me for the URL...

Brett_Tabke

5:54 pm on Nov 26, 2002 (gmt 0)

just post the url to the direct download.

Dumpy

6:02 pm on Nov 26, 2002 (gmt 0)

Thank you Brett:

The New DMOZ Search Database is at:

[bbsorg.com...]

Nick_W

6:06 pm on Nov 26, 2002 (gmt 0)

Hmm... Not to keen on the flashing banner and the popup telling me my computer was in danger of attack Dumpy ;(

Though I'm sure your program works well, i got a bit of a headache upon first visit ;)

nick

Dumpy

6:14 pm on Nov 26, 2002 (gmt 0)

Nick_w

I'm saving up to buy a castle in the south of France with 572 hetares surrounding it....I only have $2.3 million to go. Everyone is invited!

Pop-ups are once every 24 hours for each user...

Nick_W

6:15 pm on Nov 26, 2002 (gmt 0)

hehe, deal me in!

Nick

Brett_Tabke

6:21 pm on Nov 26, 2002 (gmt 0)

I thought you meant you had a new rdf dump!?

Dumpy

6:33 pm on Nov 26, 2002 (gmt 0)

Brett:

If Netscape published the software...I would have it! I use the RDF Dump every day....but not everyday for over two months. I am giving thought to going to PLAN B....which will require a lot of crawling.

If the full url is a problem...you might remove the http portion and people could cut and paste.

steveb

11:21 pm on Nov 26, 2002 (gmt 0)

"As a curtesy I have put up a search engine database current as of this weekend. I will keep it up until DMOZ is able to crawl their own site."

I don't get it. It's just the same as the DMOZ search, not up to date.

Dumpy

1:33 am on Nov 27, 2002 (gmt 0)

Steveb

The DMOZ search database on their site was from the RDF dump of mid-September.

Since then many new sites and changes have taken place, if you want to take the time to drill down in directory you will see some of them...and possible find your site has been added. However, what if you were placed into a different category...it would be difficult to find it. I crawled the site this weekend and created the databse as it exits NOW.

As of now it is the ONLY search database of DMOZ that is current. I'll wipe it out as soon as DMOZ is capable of creating it's own database.

I haven't included the World directory, but if it is wanted I can do it in a few hours. Understand this effort is a trivial exercise of a simple program run by a dumb machine...I just play with my dog and watch the horse outside my window munch grass. One doesn't have to be the brightest bulb on the tree to click a mouse, and run telenet.

Besides I am retired! Machines work, retired people don't.

jmccormac

5:18 am on Nov 27, 2002 (gmt 0)

Must have taken a while to spider the Dmoz site, Dumpy. :)

From what I can see, you are using Htdig to spider the site rather than using the RDF. The problem for all the directories and search engines picking up off Dmoz is that they use the RDF. It would be a non-trivial exercise to take the website data with the new additions and integrate them with the RDF. Though now that I think about it, a possible back of an envelope solution exists if the site html can be converted to RDF and thence to SQL and compared with the existing RDF data in SQL. The main problem though is bandwidth rather than storage. I'd guess the raw HTML (each dynamic page as a static webpage) from Dmoz would produce something in the region of 3GB of html. However for small scale branches (I was thinking of doing the Regional/Europe/Ireland branch) it is perfectly feasible on a limited bandwidth connection.

The first thing would be to write a parser for the HTML. The program would take the ExternalPage links from the Sept RDF and and check for their existence in the current page. It would then check for ExternalPage links not in the Sept RDF page. If it finds any, it would parse the details into SQL and modify the relevant page details. (I wrote a set of parsers/SQL converters/static html page generator programs in March and I still have the notes somewhere.) The only thing for distribution would be to write an SQL to RDF converter to generated the structure and content RDFs.

For some reason, all these programming problems seem trivial around 0500 Hrs. Hypnogogic states are great for delusions about the complexity of programming tasks. :)

Regards...jmcc

steveb

9:52 am on Nov 27, 2002 (gmt 0)

Dumpy,

None of the sites I know of that have been added since the RDF dump show in your search results.

steveb

9:59 am on Nov 27, 2002 (gmt 0)

<poof>

Now I do find them. Weird.

Dumpy

4:05 pm on Nov 27, 2002 (gmt 0)

I was just reading a message by a top person at OPD written a year ago, that on any given day 3000 to 3500 new sites are added to the directory.

Assuming the same condition exists today, 60 days @ 3000 would indicate 180,000 or more sites are added to the directory since the last updated RDF Dump.

These sites are not filtering through the internet...except by sites that actually crawl OPD.

jmccormac

5:19 pm on Nov 27, 2002 (gmt 0)

Assuming the same condition exists today, 60 days @ 3000 would indicate 180,000 or more sites are added to the directory since the last updated RDF Dump.

It does not take into account the natural attrition rate of domain names and websites. Dmoz's method of relying on user submissions is perhaps a good way to limit the number of sites submitted. Also does the number of submissions actually equal the number of inclusions?

Regards...jmcc

g1smd

12:08 am on Nov 28, 2002 (gmt 0)

>> Also does the number of submissions actually equal the number of inclusions? <<

As mentioned by other people on this board over the last year or two: Some parts of the directory see 95% spam submissions, while other parts see 95% listable content. The rest see anything in between those figures.

multex

10:47 pm on Nov 29, 2002 (gmt 0)

My impression, based on a fairly large sampling, is that slightly over half the submissions are fairly obvious spam. (Of course, this does not mean most submitters are spammers; one jerk with a thousand submissions skews the average considerably.) Another, say, 5-10% are "fairly devious spam". Certain targets are spam-intensive. Some unregulated patent-nostrum drug and cosmetic concoction categories, and hotel directories are way over 99% spam, even after removing the duplicate submissions. The latter sites are mostly deceptive and deceitful, to boot. If you ever want an instant reputation as a cretin, a liar AND a jerk, all you have to do is promote a hotel-reservations site. It's even more certain than selling used cars, or going to law school.

Most of the rest of the sites are submitted to the wrong category, but a small majority are submitted to somewhere near the right place.

This is not as bad as it might sound: for most topics, the outside submissions are still a better place to look for new sites than the search engines. (We thank you all for your help!) And the spammers poison the search engines also.

Lisa

12:31 am on Nov 30, 2002 (gmt 0)

I am not seeing anything new in DMOZ that I know is there...

I would just use a Google search:

site:dmoz.org term1 term2

jmccormac

4:32 am on Nov 30, 2002 (gmt 0)

I am not seeing anything new in DMOZ that I know is there...

It seems that Dmoz is doing a bit of cleaning up. Some sites that are double-listed on RDF generated pages are cleaned up to single entries. Though this could be an isolated case as I think that the category has a new editor. Some entries (cybersquatter pickups) have been deleted. I've figured out a quicker method of checking for new links on Dmoz though it only works on a page by page basis. I haven't had a chance to look at a wider sample of Dmoz pages though if the Dmoz page format is consistent it should be just a case of parsing the HTML.

It would be easy to track new entries/deletions - just use a MySQL database of ExternalPage links from the RDF that should be in a certain page, then download the page from Dmoz, check if the links are in the page and if not flag them for deletion. If new links are found, parse them into SQL and add them to the db. I'll have some free time downloading the .com/net/org zonefiles for a European domain ownership analysis project on Sunday so I will implement it for some of the Regional/Europe/Ireland pages.

Regards...jmcc

EliteWeb

7:37 am on Nov 30, 2002 (gmt 0)

I hope DMOZ peepz can work the dump-bugs out :) Seems a little bit of issues with it but I have faith it will come out properly.

I'm one person who will wait for dmoz to do their official RDF update. SO much of my data relys on the dump file.

whats up skip

10:57 am on Dec 21, 2002 (gmt 0)

I am just receiving an error when I run this search:

ht://Dig error
htsearch detected an error. Please report this to the webmaster of this site. The error message is:

Unable to read word database file
Did you run htmerge?

Dumpy

12:30 pm on Dec 21, 2002 (gmt 0)

I took the database down. It was getting dated and the DMOZ people were very upset at me for demonstrating how simple it is to do a site search.

I am creating databases for subject areas, ie; health, games, business, shopping, music, sports, etc. Of course none of these sites will EVER be included in DMOZ.

Hardwood Guy

11:26 am on Dec 26, 2002 (gmt 0)

Whoaaaa. I tried the link mentioned...http://bbsorg.com/search.html. I can't even get into the serach box. Pop ups galore! I'm assuming it's not operational Dumpy. Looks like a casino on my monitor.

Dumpy

12:31 pm on Dec 26, 2002 (gmt 0)

You didn't see that on my site!

You would have seen an error page because I deleted the database...if you did a search. Or, just the search page if you didn't. If you made an error in spelling the html page you would have been redirected to my main search site.

I don't have pop-ups, or casinos!

If DMOZ proves incapable of creating a new RDF Dump or search for their site, I will create a new database...I'm waiting for some sort of word on that subject from anyone in the know. So far, information has been slow in coming and in error!

steveb

1:10 pm on Dec 26, 2002 (gmt 0)

"So far, information has been slow in coming and in error!"

Actually it has been very prompt and quite accurate. It appears not to matter to you though.

Dumpy

3:17 pm on Dec 26, 2002 (gmt 0)

DMOZ published a message in their newsletter that the new RDF was completed and would be available as we read the newsletter.

There has been nothing else published since.

I have passed the point of understanding THE DMOZ ATTITUDE!

choster

4:15 pm on Dec 26, 2002 (gmt 0)

The ODP newsletter, as all publications, was released with best available information at the time of publication.

Second, name one other search engine or directory which makes its internally circulated newsletter available to the public? The cause of and attempts to resolve problems with the RDF feed are available here and elsewhere practically as soon as they are reported to editors.

Who's the one with the attitude?

Dumpy

5:19 pm on Dec 26, 2002 (gmt 0)

Netscape has stated that the editors and data users were the DMOZ first priority.

Thank GOD, I'm not a second priority.

As a data user I seem to be sucking hind tit. At least the editors seem to have some contact with Netscape, but data users have NONE. Netscape doesn't even do the curtesy of publishing current information on the RDF page...as it did four months ago when it published an error RDF Dump.

You can't put a shine on a rotten apple.

rafalk

7:27 pm on Dec 26, 2002 (gmt 0)

Editors have the exact same "contact" with Netscape as you do - which is practically none.

g1smd

8:53 pm on Dec 26, 2002 (gmt 0)

>> Netscape doesn't even do the curtesy of publishing current information on the RDF page... <<

*cough* [dmoz.org...] *cough*

This 35 message thread spans 2 pages: 35