Forum Moderators: open
If Netscape published the software...I would have it! I use the RDF Dump every day....but not everyday for over two months. I am giving thought to going to PLAN B....which will require a lot of crawling.
If the full url is a problem...you might remove the http portion and people could cut and paste.
The DMOZ search database on their site was from the RDF dump of mid-September.
Since then many new sites and changes have taken place, if you want to take the time to drill down in directory you will see some of them...and possible find your site has been added. However, what if you were placed into a different category...it would be difficult to find it. I crawled the site this weekend and created the databse as it exits NOW.
As of now it is the ONLY search database of DMOZ that is current. I'll wipe it out as soon as DMOZ is capable of creating it's own database.
I haven't included the World directory, but if it is wanted I can do it in a few hours. Understand this effort is a trivial exercise of a simple program run by a dumb machine...I just play with my dog and watch the horse outside my window munch grass. One doesn't have to be the brightest bulb on the tree to click a mouse, and run telenet.
Besides I am retired! Machines work, retired people don't.
From what I can see, you are using Htdig to spider the site rather than using the RDF. The problem for all the directories and search engines picking up off Dmoz is that they use the RDF. It would be a non-trivial exercise to take the website data with the new additions and integrate them with the RDF. Though now that I think about it, a possible back of an envelope solution exists if the site html can be converted to RDF and thence to SQL and compared with the existing RDF data in SQL. The main problem though is bandwidth rather than storage. I'd guess the raw HTML (each dynamic page as a static webpage) from Dmoz would produce something in the region of 3GB of html. However for small scale branches (I was thinking of doing the Regional/Europe/Ireland branch) it is perfectly feasible on a limited bandwidth connection.
The first thing would be to write a parser for the HTML. The program would take the ExternalPage links from the Sept RDF and and check for their existence in the current page. It would then check for ExternalPage links not in the Sept RDF page. If it finds any, it would parse the details into SQL and modify the relevant page details. (I wrote a set of parsers/SQL converters/static html page generator programs in March and I still have the notes somewhere.) The only thing for distribution would be to write an SQL to RDF converter to generated the structure and content RDFs.
For some reason, all these programming problems seem trivial around 0500 Hrs. Hypnogogic states are great for delusions about the complexity of programming tasks. :)
Regards...jmcc
Assuming the same condition exists today, 60 days @ 3000 would indicate 180,000 or more sites are added to the directory since the last updated RDF Dump.
These sites are not filtering through the internet...except by sites that actually crawl OPD.
Assuming the same condition exists today, 60 days @ 3000 would indicate 180,000 or more sites are added to the directory since the last updated RDF Dump.
It does not take into account the natural attrition rate of domain names and websites. Dmoz's method of relying on user submissions is perhaps a good way to limit the number of sites submitted. Also does the number of submissions actually equal the number of inclusions?
Regards...jmcc
Most of the rest of the sites are submitted to the wrong category, but a small majority are submitted to somewhere near the right place.
This is not as bad as it might sound: for most topics, the outside submissions are still a better place to look for new sites than the search engines. (We thank you all for your help!) And the spammers poison the search engines also.
I am not seeing anything new in DMOZ that I know is there...
It seems that Dmoz is doing a bit of cleaning up. Some sites that are double-listed on RDF generated pages are cleaned up to single entries. Though this could be an isolated case as I think that the category has a new editor. Some entries (cybersquatter pickups) have been deleted. I've figured out a quicker method of checking for new links on Dmoz though it only works on a page by page basis. I haven't had a chance to look at a wider sample of Dmoz pages though if the Dmoz page format is consistent it should be just a case of parsing the HTML.
It would be easy to track new entries/deletions - just use a MySQL database of ExternalPage links from the RDF that should be in a certain page, then download the page from Dmoz, check if the links are in the page and if not flag them for deletion. If new links are found, parse them into SQL and add them to the db. I'll have some free time downloading the .com/net/org zonefiles for a European domain ownership analysis project on Sunday so I will implement it for some of the Regional/Europe/Ireland pages.
Regards...jmcc
I am creating databases for subject areas, ie; health, games, business, shopping, music, sports, etc. Of course none of these sites will EVER be included in DMOZ.
You would have seen an error page because I deleted the database...if you did a search. Or, just the search page if you didn't. If you made an error in spelling the html page you would have been redirected to my main search site.
I don't have pop-ups, or casinos!
If DMOZ proves incapable of creating a new RDF Dump or search for their site, I will create a new database...I'm waiting for some sort of word on that subject from anyone in the know. So far, information has been slow in coming and in error!
Second, name one other search engine or directory which makes its internally circulated newsletter available to the public? The cause of and attempts to resolve problems with the RDF feed are available here and elsewhere practically as soon as they are reported to editors.
Who's the one with the attitude?
Thank GOD, I'm not a second priority.
As a data user I seem to be sucking hind tit. At least the editors seem to have some contact with Netscape, but data users have NONE. Netscape doesn't even do the curtesy of publishing current information on the RDF page...as it did four months ago when it published an error RDF Dump.
You can't put a shine on a rotten apple.
*cough* [dmoz.org...] *cough*