Forum Moderators: open
If i remember correctly they had a crawler to check such, no!? Don't they have the technical ressources to run a check frequently?
Yes, like says msr986, there is the time an editor looks at the red sign near the listing, examines the URL and give OK to the deleting of the URL or changes it (some time www.site.zz/index.htm becomes www.site.zz/home/default.asp). And then we have to wait that the editor data goes public.
And why aren't 404's flagged and reindexed every xy days or months and then after a special period automatically deleted from the index? The editors are pretty much in stress allready - so every automation is worth a try, no!?
A quick answer to one of your many questions....
Principally because, as an editor, I don't want to delete sites. I'll use the 404 (or the 500 or whatever) as an cue to find where the site is now. You'd be amazed how many people do things like this:
But it would be useful if Robzilla moved sites to Unreviewed if say, they've been dead for a couple of checks. That would remove them from the public side, while still leaving them available for sleuthing behind the scenes.
FYI editors can run a considerably more sophisticated link checker called TulipChain which not only detects 4xx and 5xx errors but also flags up redirects, including HTTP3xx, refresh meta tags and framed redirects. It's not built into the editing software but it is available and offers a more powerful alternative to Robozilla. It also checks sites that have been submitted but haven't yet been reviewed.
[zeal.com...]
Cheers,
Andre
I think surveys like these highlight the need for modern directories to use better technology and more automation in both the submission and update procedures. The key to understanding how this can be done is to look at how a search engine works. In a similar fashion to a search engine, a 2nd generation directory uses the web sites themselves as the data source. Such a paradigm will also open the door for self organized, automated web directories, capable of handling high volumes of submissions and updates.
1) A similiar system exists - robozilla looks for dead URLs and marks then, and we have tools like the RGBSeeker to spot them and deal with them.
2) Don't forget, that due to the server upgrade usual tasks were stopped for some time. The first one is RoboZilla and the second one is the updating of the public pages. The latter has just begun its work, so chenges made meanwhile should become visible within some time from now. As long as no updates to the public pages were made, we were of course unable to remove dead links from them.
- how fresh is their ODP data? If they pulled a RDF it might be reasonably fresh. Assuming they have already crawled some time before the 8 August data given, it is probably from beginning to mid-July (no RDF was published during the recent server migration). If OTOH they are spidering dmoz.org pages it is more than 2 months out of date.
-What's with the "TOTAL # of websites to check: 7,688,375" number? That is about double the number of entries in the ODP.
- The assumption that all 301 and 302 return codes are indicative of incorrect entries is mistaken. I for one have listed a lot of URLs from which the user is redirected from. The typical example is [company.com...] redirecting, via a browser/accept-language detection script to [company.com...] . Obviously I am not going to list the second URL (which is specific to my configuration) but the first one. Another frequent case is when a webmaster misunderstands how a directory index file is meant to work and puts a redirect in place from [anothercompany.com...] to [anothercompany.com...] . In this case I for one list the first URL because it is likely to be more stable.
- A lot 403s arise from sites blocking bots, wget etc a bit overzealously, e.g. blocking anything that does not say it is some kind of Mozilla. A lot of 500s and some other return codes arise from browser detection scripts that do something like: "if MSIE detected, serve page foo.html . Elsif Netscape detected, serve page bar.html . Elsif Safari detected, server page baz.html . Else die horribly because this case was not tested for.".
These sites work with web browsers, which is why, when we check the flagged entries, we reset the error flag.
"Link rot" means different things to different people, Go2. Define it
On the http level I would use the following definition of link rot (taken from About.com [websearch.about.com] ):
"Definition: The name given to a link that leads to a web page or site that has either moved or no longer exists."
(the dmoz survey indicated a 16% link rot on the http level)
On the content level I would define link rot as any link that results in an error page or a web page which is irrelevant given the context of the link.
(the dmoz survey estimated a 20-25% link rot on the content level)
That number is actually each link "twice" in the ODP data. I state on the page I am doing multiple checks "just in case" the site is down or server being rebooted or there for some reason are connectivity issues.
The sampling is still going on, but seems to be stabilizing at around 16.5% bad links...
At that rate, I am not sure that I want to use that data as that level of fault is not acceptable. I am going to continue the test however as there seems to be a few people who are interested in the final results.
Perhaps they have not been updating the rdf dumps but I noticed many 404's while checking the dmoz.org site. With 57,000 editors this seems very odd to have this high a level of 404's.. Could it be maybe that MS worm taking out windows hosts?
If you have been doing this the past seven weeks you may have pointlessly wasted more time than any person in human history. The pages have not changed in that time. Literally thousands and thousands of sites have been added and deleted but are not showing on the public pages. The - pages - are - static.
It's like checking every day to see if Francisco Franco is still dead.
Hey,
I'm the one doing the survey.
Good to hear from you.
If you want your survey to be useful I'd urge you to use the RDF as your data source - it is at the moment almost two months more current than the public pages. We had a Robozilla run in mid-June; the corrections from that are contained in the RDF but not yet on the public pages.
As I mentioned above 301s and 302s are not necessarily indicative of an error - these URLs are quite often listed intentionally. It would be interesting, though, to do a selective check of these redirections and filter for patterns in the Location: header. Do you by any chance store your results in a format that could be processed by others? Would it be possible to make your result dataset, esp. all non-200s with category path and HTTP code, available for download on your site? We use a very diverse toolset to hunt for broken links, changed content and hijacks, and one more approach can be of use.