They do run a check, but then the results are checked manually by a human editor. The bot puts the 404 page into the unreviewed queue. Also, the current public side of the dmoz is quite old. They reverted to an older index while they are performing their "upgrades". The situation may not be as bad as it seems.
Yes, there is "Robozilla", a robot that check URL from a category to an other. But it can detects not used domains by receiving the 404 error.
If the domains changed theme, if there is a redirect to an other URL, or if now there is a page with "this domain is to sell", Robozilla doesn't get any error. URL is live and running.
Yes, like says msr986, there is the time an editor looks at the red sign near the listing, examines the URL and give OK to the deleting of the URL or changes it (some time www.site.zz/index.htm becomes www.site.zz/home/default.asp). And then we have to wait that the editor data goes public.
|And why aren't 404's flagged and reindexed every xy days or months and then after a special period automatically deleted from the index? The editors are pretty much in stress allready - so every automation is worth a try, no!? |
A quick answer to one of your many questions....
Principally because, as an editor, I don't want to delete sites. I'll use the 404 (or the 500 or whatever) as an cue to find where the site is now. You'd be amazed how many people do things like this: change their domain from *.com to *.net (or whatever) without telling anyone
"tidy up" their URLs -- maybe renaming all the *.html to *.htm again without telling anyone.
move from one free supplier to another -- eg from Geocities to Freeserve.
If i can find the new site, I'll update the URL. But this sometimes takes weeks -- especially if I have to wait for the new site to be in search engines so I can find it.
But it would be useful if Robzilla moved sites to Unreviewed if say, they've been dead for a couple of checks. That would remove them from the public side, while still leaving them available for sleuthing behind the scenes.
|Definitely you have to do it! Damn, i don't have the time to contribute .. |
Aw, come on, you know you wanna... ;)
Once the upgrade is done, it's possible that erraneous (sp?) listings will be taking out of public view, but as yet they stay in view until sorted. In some cases that's fine - we get a few 'false positives' where sites are temporarily down or who block Robozilla's user agent string.
FYI editors can run a considerably more sophisticated link checker called TulipChain which not only detects 4xx and 5xx errors but also flags up redirects, including HTTP3xx, refresh meta tags and framed redirects. It's not built into the editing software but it is available and offers a more powerful alternative to Robozilla. It also checks sites that have been submitted but haven't yet been reviewed.
This link might be of interest
That link should be a front-runner for the Darwin Awards. How thick do you have to be to keep crawling pages that have been *static* for more than six weeks?
TulipChain? I always thought those things (the actual things made of flowers) were called 'daisy chains'? Is DMOZ run by a bunch of 5-pointers or what? ;)
Ps. If you don't get the joke, just forget it -- it is a bit obscure.
Another approach to the problem of whether or not to "get rid of" links that are found dead by a dead link checking bot is to mark them dead in place and create summary reports of where they exist. At least that's what we do on Zeal:
If the information provided in the above referenced dmoz survey proves reliable it certainly poses a serious question regarding the quality of the dmoz directory. In a related thread [webmasterworld.com] about a a year ago, a link rot of about 10% was estimated based on a sample of 250,000 URLs.
I think surveys like these highlight the need for modern directories to use better technology and more automation in both the submission and update procedures. The key to understanding how this can be done is to look at how a search engine works. In a similar fashion to a search engine, a 2nd generation directory uses the web sites themselves as the data source. Such a paradigm will also open the door for self organized, automated web directories, capable of handling high volumes of submissions and updates.
1) A similiar system exists - robozilla looks for dead URLs and marks then, and we have tools like the RGBSeeker to spot them and deal with them.
2) Don't forget, that due to the server upgrade usual tasks were stopped for some time. The first one is RoboZilla and the second one is the updating of the public pages. The latter has just begun its work, so chenges made meanwhile should become visible within some time from now. As long as no updates to the public pages were made, we were of course unable to remove dead links from them.
There are some unclear things and some assumptions that are plain wrong on [tagword.com ]
- how fresh is their ODP data? If they pulled a RDF it might be reasonably fresh. Assuming they have already crawled some time before the 8 August data given, it is probably from beginning to mid-July (no RDF was published during the recent server migration). If OTOH they are spidering dmoz.org pages it is more than 2 months out of date.
-What's with the "TOTAL # of websites to check: 7,688,375" number? That is about double the number of entries in the ODP.
- The assumption that all 301 and 302 return codes are indicative of incorrect entries is mistaken. I for one have listed a lot of URLs from which the user is redirected from. The typical example is [company.com...] redirecting, via a browser/accept-language detection script to [company.com...] . Obviously I am not going to list the second URL (which is specific to my configuration) but the first one. Another frequent case is when a webmaster misunderstands how a directory index file is meant to work and puts a redirect in place from [anothercompany.com...] to [anothercompany.com...] . In this case I for one list the first URL because it is likely to be more stable.
- A lot 403s arise from sites blocking bots, wget etc a bit overzealously, e.g. blocking anything that does not say it is some kind of Mozilla. A lot of 500s and some other return codes arise from browser detection scripts that do something like: "if MSIE detected, serve page foo.html . Elsif Netscape detected, serve page bar.html . Elsif Safari detected, server page baz.html . Else die horribly because this case was not tested for.".
These sites work with web browsers, which is why, when we check the flagged entries, we reset the error flag.
"Link rot" means different things to different people, Go2. Define it.
|"Link rot" means different things to different people, Go2. Define it |
On the http level I would use the following definition of link rot (taken from About.com [websearch.about.com] ):
"Definition: The name given to a link that leads to a web page or site that has either moved or no longer exists."
(the dmoz survey indicated a 16% link rot on the http level)
On the content level I would define link rot as any link that results in an error page or a web page which is irrelevant given the context of the link.
(the dmoz survey estimated a 20-25% link rot on the content level)
Even if you assume the strictest definition of link rot, if you're spidering dmoz.org your count is going to seriously disorted by the following: "edit" links that go to internal ODP pages. There's one on each page. Count on getting over 450,000 403 errors unless the parser excludes those.
The majority of categories have FAQs and Descriptions, many of those have hard links that are broken. Those technically should not be included in your count. For example go to this FAQ [dmoz.org] and scroll down to question 1.4 . Both links are broken. Same goes for internal links in the FAQ's and Descriptions.
Many ODP categories still have a hard search link at the bottom to defunct SE's like NorthernLight. Unless excluded, that too will throw off your count considerably
I'm the one doing the survey.
To answer questions:
That number is actually each link "twice" in the ODP data. I state on the page I am doing multiple checks "just in case" the site is down or server being rebooted or there for some reason are connectivity issues.
The sampling is still going on, but seems to be stabilizing at around 16.5% bad links...
At that rate, I am not sure that I want to use that data as that level of fault is not acceptable. I am going to continue the test however as there seems to be a few people who are interested in the final results.
Perhaps they have not been updating the rdf dumps but I noticed many 404's while checking the dmoz.org site. With 57,000 editors this seems very odd to have this high a level of 404's.. Could it be maybe that MS worm taking out windows hosts?
>With 57,000 editors this seems very odd to have this high a level of 404's..
That's the number of editors that have ever been with the ODP. I remember a meta posting that the number of currently active editors is much less. And, most editors only can edit in very limited areas of the ODP.
"I noticed many 404's while checking the dmoz.org site."
If you have been doing this the past seven weeks you may have pointlessly wasted more time than any person in human history. The pages have not changed in that time. Literally thousands and thousands of sites have been added and deleted but are not showing on the public pages. The - pages - are - static.
It's like checking every day to see if Francisco Franco is still dead.
I'm the one doing the survey.
Good to hear from you.
If you want your survey to be useful I'd urge you to use the RDF as your data source - it is at the moment almost two months more current than the public pages. We had a Robozilla run in mid-June; the corrections from that are contained in the RDF but not yet on the public pages.
As I mentioned above 301s and 302s are not necessarily indicative of an error - these URLs are quite often listed intentionally. It would be interesting, though, to do a selective check of these redirections and filter for patterns in the Location: header. Do you by any chance store your results in a format that could be processed by others? Would it be possible to make your result dataset, esp. all non-200s with category path and HTTP code, available for download on your site? We use a very diverse toolset to hunt for broken links, changed content and hijacks, and one more approach can be of use.