Will site be dropped from DMOZ?

Forum Moderators: open

Message Too Old, No Replies

Will site be dropped from DMOZ?

Valid link erroneously reported as broken

geekay

5:39 am on Apr 9, 2004 (gmt 0)

DMOZ is obviously using the spidering services of a certain company (the name of which is not important here) to periodically check the validity of all the links in the directories.

On my server all requests for www.example.com/dir are returning a 301 and are then redirected to www.example.com/dir/. Now I notice in my web log that this particular spider requested /dir, but did not follow the redirect. It did not wait for the status 200 page, but quitted.

My question is, could this result in my site being automatically dropped from DMOZ, or will the validity of the link be manually checked before excluding?

ncw164x

5:51 am on Apr 9, 2004 (gmt 0)

What makes you think that it is another companies link checker, they have their own spider called Robozilla.

ncw164x

RFranzen

6:03 am on Apr 9, 2004 (gmt 0)

Like ncw said, we don't use a third party's spider.

Assuming that our robozilla did decide that a listed site was unavailable, it would be flagged "red", not deleted. An editor will eventually manually check red links. I'm not sure, but I think if a red from last month has not yet been handled manually, a 2nd failure this month would have the site move to our unreviewed queue. While not the same as auto-delete, this would cause the site to disappear publicly until someone has a chance to check.

-- Rich

geekay

6:30 am on Apr 9, 2004 (gmt 0)

This was not Robozilla. Some investigation and comparing with other webmasters led me to the conclusion that this bot is specifically spidering single pages listed on DMOZ. It has not been seen requesting any other pages. But perhaps this is a coincidence due to a sample that is too small. It is also very possible that this bot is using the links in DMOZ for other web research purposes, because the links in DMOZ is such a representative collection.
BTW, my site's link in DMOZ is correct, /dir/.

podman

6:50 am on Apr 9, 2004 (gmt 0)

And this is how the myths about ODP start. A spider from a third party company lost my web site. DMOZ fails again.

hutcheson

6:59 am on Apr 9, 2004 (gmt 0)

Well, obviously anybody -- honest directory purveyor, spammer, mass murderer -- can download the ODP data, and writing a spider is fairly trivial. At least one legitimate user (thumbshots.org) is known to be spidering it periodically.

We have our own link checker, which identifies itself as "robozilla". And there are some editor-written link checkers, which editors can run on a particular category.

But there is no third-party spider that we use. And a 301 redirect won't cause a site to be removed; it will cause the link to be updated.

ncw164x

7:02 am on Apr 9, 2004 (gmt 0)

Please explain this one to me "podman"

Anyone can download the dmoz rdf file which can be used on your website.

Why does someone link checking which is totally independent of dmoz lead to the failure of dmoz?

ncw164x

geekay

7:14 am on Apr 9, 2004 (gmt 0)

*lol*, podman, I think I understood your humour perfectly.
The more I learn about this matter the more convinced I am that this bot is just a legitimate and useful web research spider. It is sufficient for the bot to register how many requests return a 301, how many a 200, etc. It would not, however, be proper for me to disclose the bot's name from my web log. It could be that this usage of DMOZ has bandwidth implications.

g1smd

6:22 pm on Apr 9, 2004 (gmt 0)

Ummm, they might not even be using dmoz.org at all.

They are probably using their own local copy of the RDF file, which is freely available at [rdf.dmoz.org...] as well as older copies in the /archive folder.

hutcheson

11:56 pm on Apr 9, 2004 (gmt 0)

Reading this again: you say "Valid link erroneously reported as broken" ...

But reported to whom? from whom? and how did you intercept is? Exactly what did it say? -- perhaps you misinterpreted it.

Without knowing the details, almost anything that can be said (except the fact that the ODP doesn't use any such thing) is wild-eyed tinfoil-hat mouth-frothing fantasy.

geekay

5:33 am on Apr 10, 2004 (gmt 0)

Reported to: "certain company".
Reported from: their "spider".
Intercept: by checking my "web log", which said my server returned a "301" which was not followed, as it normally should have been.
There is not enough space in the heading to use very precise expressions and English is not my native language. "Broken" link gives, I believed, a good idea of the presumed cause for a site being dropped.

Sorry for being a "wild-eyed tinfoil-hat mouth-frothing" dreamer, but if everybody would know everything there would be no need for this forum. I think it is better that people are encouraged to ask instead of continuing to have fantasies. I am grateful to Rich Franzen who got my post exactly right and gave a complete and open answer. Now nobody needs to be uninformed about how OPD works.

My own msg #8 contains the explanation of the spider. There is no need to continue the discussion. But it could be added that a research spider programme that omits the end slash, which is included in the link it uses as source, and therefore returns an unwarranted 301 to the statistics it collects, obviously has a flaw. In any case this is not a true 301.

g1smd

6:14 pm on Apr 10, 2004 (gmt 0)

Umm, I still have no idea why you say that they are using the ODP as a source but then say that whoever it was missed off the trailing slash. We already know that the ODP data includes the trailing slash. If they missed it off then they were NOT using ODP data, were they, right?

If it was a third-party company spidering your site then I fail to see why the ODP or ODP editors are involved in anyway whatsoever at all. This is half a story, using unconnected factors, speculation and guesswork, and undoubtably leads to a wrong conclusion.

Sounds like a non-event to me.

mbauser2

5:15 am on Apr 11, 2004 (gmt 0)

In any case this is not a true 301.

You need to move past this notion of "a true 301". There is no law of physics or jurisprudence that requires a robot follow the redirect. For all you know, the bot could have been intentionally programmed to not follow redirects.

geekay

6:34 pm on Apr 12, 2004 (gmt 0)

It is clear enough that this spider was/is indeed using OPD's collection of links as a source for its research tasks. I said that already in my msg #4 and I assure that conclusion is sufficiently well founded. I have, by now, also found the home page of this company. g1smd, in his first message, #9, explained that OPD's RDF file is freely available, and it certainly must be a good source (one of many) for this statistical research.

I said in my msg #8 that it is obvious that this spider collects data for statistical purposes of e.g. how many requests, to servers worldwide, return a 301. mbauser2 has not read this my msg. Of course this spider is programmed not to follow 301's, as it just needs to count the number of 301 responses it gets (and 200's etc). But I did not know all that when I started this thread.

However, it does look like this spider programme shortens links coded as ending with "index.html", e.g. "/dir/index.html", to just /dir, thus creating itself a 301 status. This can be considered a flaw in a research programme, as it causes a bias in the statistics the company prepares. I argued additionally that a redirect caused by a missing trailing slash should in no case be counted as a true redirect. Nearly all servers automatically redirect requests for page /dir to page /dir/ even if there is no mod_rewrite.

The last has, of course, nothing to do with OPD, I simply found it noteworthy. The above "flaw" caused me to post my question in the first instance. Although my posting later proved to be unnecessary, I am disappointed over the discouraging treatment I received here from one fellow member. I find it useless to continue this discussion.

hutcheson

8:53 pm on Apr 12, 2004 (gmt 0)

>I assure that conclusion is sufficiently well founded.

As founded as the assumption that your site was about to be dropped from the ODP?

Sorry, we are interested in the phenomena, but real data will get you a lot further, and a lot more help, than empty assurances and wild assumptions.

RFranzen

6:17 am on Apr 13, 2004 (gmt 0)

Hutch, let it drop, please. Geekay is not attacking, nor has he been. His first post was a confused question, that's all. Since post #4, it seems to me he understands the situation, but several posters seemingly can't get past post #1.

-- Rich

tschild

8:41 pm on Apr 14, 2004 (gmt 0)

One important note to make, to address webmasters' concerns: whether a site is spidered by Robozilla or by one of the several editor-run checking tools, the result of this spidering will not cause the listing to be automatically dropped, i.e. without human intervention. Rather it will at most be marked in place, or listed in a list/report for editors, as a listing that needs to be looked at.

hutcheson

10:00 pm on Apr 14, 2004 (gmt 0)

tschild is correct. However, the converse isn't true. An editor can drop a "non-responding" site without waiting for the autochecker. This isn't REAL likely to happen--I've seen it happen maybe once or twice as a result of webmaster error (returning an odd status as a result of misconfigured server.) Also, editors tend to trim the trailing filename from the URL if that works -- we'd rather have the shorter URL, and give the webmaster the ability to change the home file name, at the expense of a 301. This has occasionally been known to be done even when it broke the URL. Best-practice recommendation: make sure the directory and the index.htm file are both acceptable URLs.