Forum Moderators: open
http://resource-zone.com/forum/showthread.php?t=51786&page=2
Apparently, Pmoz is one of two major bots that checks websites listed in the ODP. It comes from a shared hosting range, and it does NOT read robots.txt. If you feed it a 403 for any reason, then after 2-3 visits, your website will be automatically removed from the ODP until an editor manually reviews the website (which can take weeks or months).
The developer(s) of this bot indicate that they don't intend to change the way any of this works.
Thread also hints at link checking tools disguised as browsers. The same process may apply if these stealth tools are denied access to your site (ie., guilty until proven innocent). However, this has not been confirmed by the editors.
[edited by: incrediBILL at 5:14 pm (utc) on July 24, 2008]
[edit reason] cleaned up link [/edit]
I run a directory and also link check without checking robots.txt.
My members have submitted the page they want listed, so it's obviously approved to be checked in advance by the mere act of submission into my directory.
If my link checker fails to check your page, just like DMOZ, your listing gets dropped.
Assuming you wish to stay in DMOZ's listings, drop the 403 and forget about robots.txt
concerning linkcheckers robots.txt: since a linkchecker should hit exactly one time per link it checks and should not index anything (other than, say, a checksum or the IP) but merely look if the request is successful (might even do a HEAD), I'd be fine with not honoring robots.txt. As soon as it starts following links on my site, it's not a checker anymore but a robot and it better obey my wishes.
My members have submitted the page they want listed, so it's obviously approved to be checked in advance by the mere act of submission into my directory.
Just playing devil's advocate here... How about directories in which the owner did not purposefully submit the site? Supposedly ODP editors add sites of their own volition. So the site owners don't always expect the link checker. Or, they don't know how the link checker will look and/or where it will come from.
If my link checker fails to check your page, just like DMOZ, your listing gets dropped
Continuing the devil's advocacy... imagine that your link checker had a user agent of libwww or Xenu, or it came from a spammy colo, or it's missing some important headers. You send your link checker over to wilderness' paranoid website. Your link checker gets blocked automatically by user agent or IP address. Listing gets dropped. Meanwhile, 99% of humans can still access Wilderness's website.
Or you employ a citizen of another country and his ISP is spammer friendly. You run a directory intended for the UK and the US. Your employee with his spammy foreign IP can't access Wilderness' paranoid website. Listing gets dropped. Meanwhile, 99% of the UK and US can still see the site.
Tough luck? Probably. But I'm sure that Google has many stealth bots going around and they don't automatically drop sites from Google's index after a 403, as Google is pragmatic enough to realize that just because their stealth checker is blocked, it doesn't mean the site is cloaking. It just means the website's security system is doing its job. Could you imagine the uproar if it were otherwise?
Continuing the devil's advocacy... imagine that your link checker had a user agent of libwww or Xenu, or it came from a spammy colo, or it's missing some important headers. You send your link checker over to wilderness' paranoid website. Your link checker gets blocked automatically by user agent or IP address. Listing gets dropped. Meanwhile, 99% of humans can still access Wilderness's website.Or you employ a citizen of another country and his ISP is spammer friendly. You run a directory intended for the UK and the US. Your employee with his spammy foreign IP can't access Wilderness' paranoid website. Listing gets dropped. Meanwhile, 99% of the UK and US can still see the site.
I'm not sure if I reached some absurd status quo or should take this a back-handed compliment.
In any event it's "website (s)" ;)
BTW, link checkers are out, with the exception of two widget orgs.
[dmoz.org...]
If the Pmoz checker removes listings automatically then it should be officially stated by DMOZ so that those who submit to the directory will know what to expect and can allow it access.
Perhaps I missed it, but anything less would seem unreasonable.
...
On one hand I need to keep my content fresh yet on the other hand I need to keep my bandwidth down so others doing the very same thing I'm doing get slammed.
I don't have a good solution but less than 100 sites seem to snare my link checker out of 37K listings, and most of it is one specific host that seems to have their own service-wide bot blocker that's blocking my hosting company, not my user agent.
FWIW, it's not just HTTP errors like 400s and 500s that good link checkers evaluate. The page content is also examined as many improperly configured sites have soft 404 errors or pages that end in "400error.asp" that return a "200 OK". Not to mention a large profile of domain park pages, viruses, etc. and a whole lot more.
Anyway, if someone runs a link checker from a spam host it deserves to get blocked IMO just for using such a shady location.
The problem you have is spam hosts tend to be cheap and most people are unaware of their reputation so once again it's the innocents suffering thanks to the actions of others.
74.208.16.* "Mozilla/5.0 (compatible; pmoz.info ODP link checker; +http://pmoz.info/doc/botinfo.htm)"
On their page they claim:
Visits from this process generally come from IPs 74.208.25.118 or 216.15.74.85.
Which wasn't the IPs that hit my site, so either the info is outdated, or they use other IPs as well.
If the Pmoz checker removes listings automatically then it should be officially stated by DMOZ so that those who submit to the directory will know what to expect and can allow it access.Perhaps I missed it, but anything less would seem unreasonable.
The developer of the bot has said the "two highest-impact bots ODP uses [are] the official robozilla bot, and my pmoz.info bot"
The problem you have is spam hosts tend to be cheap and most people are unaware of their reputation
I don't how spammy that host is, but the developer wrote "Yes, my tools are hosted on a shared server. I cannot control what trouble my fellow hostees get into".
[edited by: Umbra at 12:53 am (utc) on July 25, 2008]
Whatever, with variable and suspect IPs it's blocked.
I thought dmoz had died anyway. I was unable to update entries for a couple of years because of a duff submission form so I just gave up on it. As far as I can tell the only serious user of dmoz now is google, and right now even that's wrong.
the official robozilla bot, and my pmoz.info bot
So in order to stay listed in DMOZ you have to cater for an unofficial robot that is not mentioned anywhere on the DMOZ site and which seems to be the personal project of an individual admin.
I don't recall this being mentioned when you submit a site, but then I haven't bothered lately.
...
As far as I can tell the only serious user of dmoz now is google, and right now even that's wrong.
Over on the Google forum, there's a thread about a Page Rank update. So I checked out the Google Directory. I don't remember how it was before, but now many directory pages have a low or zero Page Rank. Then over on Dmoz, also low Page Ranks for many subcategories. The PR update may still be in flux, but it would be interesting to keep on an eye on this. Assuming that Google applies its algorithms equally across the board, then perhaps this is a sign that directories ARE less important. Which would be a relief to anyone at the mercy of a bot like pmoz.
In light of pmoz, which we now ban, it will be interesting to see what happens next. Still, not really holding breath over a dmoz update anyway - human editor for every one of millions of sites? Yeah, right!