homepage Welcome to WebmasterWorld Guest from 23.20.28.193
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Directories
Forum Library, Charter, Moderators: Webwork & skibum

Directories Forum

    
Will site be dropped from DMOZ?
Valid link erroneously reported as broken
geekay




msg:483364
 5:39 am on Apr 9, 2004 (gmt 0)

DMOZ is obviously using the spidering services of a certain company (the name of which is not important here) to periodically check the validity of all the links in the directories.

On my server all requests for www.example.com/dir are returning a 301 and are then redirected to www.example.com/dir/. Now I notice in my web log that this particular spider requested /dir, but did not follow the redirect. It did not wait for the status 200 page, but quitted.

My question is, could this result in my site being automatically dropped from DMOZ, or will the validity of the link be manually checked before excluding?

 

ncw164x




msg:483365
 5:51 am on Apr 9, 2004 (gmt 0)

What makes you think that it is another companies link checker, they have their own spider called Robozilla.

ncw164x

RFranzen




msg:483366
 6:03 am on Apr 9, 2004 (gmt 0)

Like ncw said, we don't use a third party's spider.

Assuming that our robozilla did decide that a listed site was unavailable, it would be flagged "red", not deleted. An editor will eventually manually check red links. I'm not sure, but I think if a red from last month has not yet been handled manually, a 2nd failure this month would have the site move to our unreviewed queue. While not the same as auto-delete, this would cause the site to disappear publicly until someone has a chance to check.

-- Rich

geekay




msg:483367
 6:30 am on Apr 9, 2004 (gmt 0)

This was not Robozilla. Some investigation and comparing with other webmasters led me to the conclusion that this bot is specifically spidering single pages listed on DMOZ. It has not been seen requesting any other pages. But perhaps this is a coincidence due to a sample that is too small. It is also very possible that this bot is using the links in DMOZ for other web research purposes, because the links in DMOZ is such a representative collection.
BTW, my site's link in DMOZ is correct, /dir/.

podman




msg:483368
 6:50 am on Apr 9, 2004 (gmt 0)

And this is how the myths about ODP start. A spider from a third party company lost my web site. DMOZ fails again.

hutcheson




msg:483369
 6:59 am on Apr 9, 2004 (gmt 0)

Well, obviously anybody -- honest directory purveyor, spammer, mass murderer -- can download the ODP data, and writing a spider is fairly trivial. At least one legitimate user (thumbshots.org) is known to be spidering it periodically.

We have our own link checker, which identifies itself as "robozilla". And there are some editor-written link checkers, which editors can run on a particular category.

But there is no third-party spider that we use. And a 301 redirect won't cause a site to be removed; it will cause the link to be updated.

ncw164x




msg:483370
 7:02 am on Apr 9, 2004 (gmt 0)

Please explain this one to me "podman"

Anyone can download the dmoz rdf file which can be used on your website.

Why does someone link checking which is totally independent of dmoz lead to the failure of dmoz?

ncw164x

geekay




msg:483371
 7:14 am on Apr 9, 2004 (gmt 0)

*lol*, podman, I think I understood your humour perfectly.
The more I learn about this matter the more convinced I am that this bot is just a legitimate and useful web research spider. It is sufficient for the bot to register how many requests return a 301, how many a 200, etc. It would not, however, be proper for me to disclose the bot's name from my web log. It could be that this usage of DMOZ has bandwidth implications.

g1smd




msg:483372
 6:22 pm on Apr 9, 2004 (gmt 0)

Ummm, they might not even be using dmoz.org at all.

They are probably using their own local copy of the RDF file, which is freely available at [rdf.dmoz.org...] as well as older copies in the /archive folder.

hutcheson




msg:483373
 11:56 pm on Apr 9, 2004 (gmt 0)

Reading this again: you say "Valid link erroneously reported as broken" ...

But reported to whom? from whom? and how did you intercept is? Exactly what did it say? -- perhaps you misinterpreted it.

Without knowing the details, almost anything that can be said (except the fact that the ODP doesn't use any such thing) is wild-eyed tinfoil-hat mouth-frothing fantasy.

geekay




msg:483374
 5:33 am on Apr 10, 2004 (gmt 0)

Reported to: "certain company".
Reported from: their "spider".
Intercept: by checking my "web log", which said my server returned a "301" which was not followed, as it normally should have been.
There is not enough space in the heading to use very precise expressions and English is not my native language. "Broken" link gives, I believed, a good idea of the presumed cause for a site being dropped.

Sorry for being a "wild-eyed tinfoil-hat mouth-frothing" dreamer, but if everybody would know everything there would be no need for this forum. I think it is better that people are encouraged to ask instead of continuing to have fantasies. I am grateful to Rich Franzen who got my post exactly right and gave a complete and open answer. Now nobody needs to be uninformed about how OPD works.

My own msg #8 contains the explanation of the spider. There is no need to continue the discussion. But it could be added that a research spider programme that omits the end slash, which is included in the link it uses as source, and therefore returns an unwarranted 301 to the statistics it collects, obviously has a flaw. In any case this is not a true 301.

g1smd




msg:483375
 6:14 pm on Apr 10, 2004 (gmt 0)

Umm, I still have no idea why you say that they are using the ODP as a source but then say that whoever it was missed off the trailing slash. We already know that the ODP data includes the trailing slash. If they missed it off then they were NOT using ODP data, were they, right?

If it was a third-party company spidering your site then I fail to see why the ODP or ODP editors are involved in anyway whatsoever at all. This is half a story, using unconnected factors, speculation and guesswork, and undoubtably leads to a wrong conclusion.

Sounds like a non-event to me.

mbauser2




msg:483376
 5:15 am on Apr 11, 2004 (gmt 0)

In any case this is not a true 301.

You need to move past this notion of "a true 301". There is no law of physics or jurisprudence that requires a robot follow the redirect. For all you know, the bot could have been intentionally programmed to not follow redirects.

geekay




msg:483377
 6:34 pm on Apr 12, 2004 (gmt 0)

It is clear enough that this spider was/is indeed using OPD's collection of links as a source for its research tasks. I said that already in my msg #4 and I assure that conclusion is sufficiently well founded. I have, by now, also found the home page of this company. g1smd, in his first message, #9, explained that OPD's RDF file is freely available, and it certainly must be a good source (one of many) for this statistical research.

I said in my msg #8 that it is obvious that this spider collects data for statistical purposes of e.g. how many requests, to servers worldwide, return a 301. mbauser2 has not read this my msg. Of course this spider is programmed not to follow 301's, as it just needs to count the number of 301 responses it gets (and 200's etc). But I did not know all that when I started this thread.

However, it does look like this spider programme shortens links coded as ending with "index.html", e.g. "/dir/index.html", to just /dir, thus creating itself a 301 status. This can be considered a flaw in a research programme, as it causes a bias in the statistics the company prepares. I argued additionally that a redirect caused by a missing trailing slash should in no case be counted as a true redirect. Nearly all servers automatically redirect requests for page /dir to page /dir/ even if there is no mod_rewrite.

The last has, of course, nothing to do with OPD, I simply found it noteworthy. The above "flaw" caused me to post my question in the first instance. Although my posting later proved to be unnecessary, I am disappointed over the discouraging treatment I received here from one fellow member. I find it useless to continue this discussion.

hutcheson




msg:483378
 8:53 pm on Apr 12, 2004 (gmt 0)

>I assure that conclusion is sufficiently well founded.

As founded as the assumption that your site was about to be dropped from the ODP?

Sorry, we are interested in the phenomena, but real data will get you a lot further, and a lot more help, than empty assurances and wild assumptions.

RFranzen




msg:483379
 6:17 am on Apr 13, 2004 (gmt 0)

Hutch, let it drop, please. Geekay is not attacking, nor has he been. His first post was a confused question, that's all. Since post #4, it seems to me he understands the situation, but several posters seemingly can't get past post #1.

-- Rich

tschild




msg:483380
 8:41 pm on Apr 14, 2004 (gmt 0)

One important note to make, to address webmasters' concerns: whether a site is spidered by Robozilla or by one of the several editor-run checking tools, the result of this spidering will not cause the listing to be automatically dropped, i.e. without human intervention. Rather it will at most be marked in place, or listed in a list/report for editors, as a listing that needs to be looked at.

hutcheson




msg:483381
 10:00 pm on Apr 14, 2004 (gmt 0)

tschild is correct. However, the converse isn't true. An editor can drop a "non-responding" site without waiting for the autochecker. This isn't REAL likely to happen--I've seen it happen maybe once or twice as a result of webmaster error (returning an odd status as a result of misconfigured server.) Also, editors tend to trim the trailing filename from the URL if that works -- we'd rather have the shorter URL, and give the webmaster the ability to change the home file name, at the expense of a 301. This has occasionally been known to be done even when it broke the URL. Best-practice recommendation: make sure the directory and the index.htm file are both acceptable URLs.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Directories
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved