Forum Moderators: open

Message Too Old, No Replies

ODP/dmoz.org link checker

pmoz bot does not read robots.txt

         

Umbra

1:04 pm on Jul 24, 2008 (gmt 0)

10+ Year Member



From this forum thread:
http://resource-zone.com/forum/showthread.php?t=51786&page=2

Apparently, Pmoz is one of two major bots that checks websites listed in the ODP. It comes from a shared hosting range, and it does NOT read robots.txt. If you feed it a 403 for any reason, then after 2-3 visits, your website will be automatically removed from the ODP until an editor manually reviews the website (which can take weeks or months).

The developer(s) of this bot indicate that they don't intend to change the way any of this works.

Thread also hints at link checking tools disguised as browsers. The same process may apply if these stealth tools are denied access to your site (ie., guilty until proven innocent). However, this has not been confirmed by the editors.

[edited by: incrediBILL at 5:14 pm (utc) on July 24, 2008]
[edit reason] cleaned up link [/edit]

incrediBILL

5:10 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A link checker isn't a spider so I don't think robots.txt is appropriate.

I run a directory and also link check without checking robots.txt.

My members have submitted the page they want listed, so it's obviously approved to be checked in advance by the mere act of submission into my directory.

If my link checker fails to check your page, just like DMOZ, your listing gets dropped.

Assuming you wish to stay in DMOZ's listings, drop the 403 and forget about robots.txt

Umbra

6:56 pm on Jul 24, 2008 (gmt 0)

10+ Year Member



This was just intended as a heads-up. Personally, I'm undecided on the etiquette of robots.txt and link checkers. I generally do block many shared hosts, thanks to scrapers and all... who knew shared hosting would host something as important as an ODP checker?

janharders

7:11 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



yeah, that truly is a little weird and unprofessional.

concerning linkcheckers robots.txt: since a linkchecker should hit exactly one time per link it checks and should not index anything (other than, say, a checksum or the IP) but merely look if the request is successful (might even do a HEAD), I'd be fine with not honoring robots.txt. As soon as it starts following links on my site, it's not a checker anymore but a robot and it better obey my wishes.

Umbra

8:28 pm on Jul 24, 2008 (gmt 0)

10+ Year Member



My members have submitted the page they want listed, so it's obviously approved to be checked in advance by the mere act of submission into my directory.

Just playing devil's advocate here... How about directories in which the owner did not purposefully submit the site? Supposedly ODP editors add sites of their own volition. So the site owners don't always expect the link checker. Or, they don't know how the link checker will look and/or where it will come from.

If my link checker fails to check your page, just like DMOZ, your listing gets dropped

Continuing the devil's advocacy... imagine that your link checker had a user agent of libwww or Xenu, or it came from a spammy colo, or it's missing some important headers. You send your link checker over to wilderness' paranoid website. Your link checker gets blocked automatically by user agent or IP address. Listing gets dropped. Meanwhile, 99% of humans can still access Wilderness's website.

Or you employ a citizen of another country and his ISP is spammer friendly. You run a directory intended for the UK and the US. Your employee with his spammy foreign IP can't access Wilderness' paranoid website. Listing gets dropped. Meanwhile, 99% of the UK and US can still see the site.

Tough luck? Probably. But I'm sure that Google has many stealth bots going around and they don't automatically drop sites from Google's index after a 403, as Google is pragmatic enough to realize that just because their stealth checker is blocked, it doesn't mean the site is cloaking. It just means the website's security system is doing its job. Could you imagine the uproar if it were otherwise?

wilderness

9:41 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Continuing the devil's advocacy... imagine that your link checker had a user agent of libwww or Xenu, or it came from a spammy colo, or it's missing some important headers. You send your link checker over to wilderness' paranoid website. Your link checker gets blocked automatically by user agent or IP address. Listing gets dropped. Meanwhile, 99% of humans can still access Wilderness's website.

Or you employ a citizen of another country and his ISP is spammer friendly. You run a directory intended for the UK and the US. Your employee with his spammy foreign IP can't access Wilderness' paranoid website. Listing gets dropped. Meanwhile, 99% of the UK and US can still see the site.

I'm not sure if I reached some absurd status quo or should take this a back-handed compliment.

In any event it's "website (s)" ;)

BTW, link checkers are out, with the exception of two widget orgs.

Samizdata

10:20 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robozilla is the only link checker mentioned by DMOZ themselves as far as I am aware.

[dmoz.org...]

If the Pmoz checker removes listings automatically then it should be officially stated by DMOZ so that those who submit to the directory will know what to expect and can allow it access.

Perhaps I missed it, but anything less would seem unreasonable.

...

incrediBILL

11:32 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




My paranoid site blocks link checkers although it runs one, how's that for hypocrisy? ;)

On one hand I need to keep my content fresh yet on the other hand I need to keep my bandwidth down so others doing the very same thing I'm doing get slammed.

I don't have a good solution but less than 100 sites seem to snare my link checker out of 37K listings, and most of it is one specific host that seems to have their own service-wide bot blocker that's blocking my hosting company, not my user agent.

FWIW, it's not just HTTP errors like 400s and 500s that good link checkers evaluate. The page content is also examined as many improperly configured sites have soft 404 errors or pages that end in "400error.asp" that return a "200 OK". Not to mention a large profile of domain park pages, viruses, etc. and a whole lot more.

Anyway, if someone runs a link checker from a spam host it deserves to get blocked IMO just for using such a shady location.

The problem you have is spam hosts tend to be cheap and most people are unaware of their reputation so once again it's the innocents suffering thanks to the actions of others.

wilderness

11:39 pm on Jul 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My paranoid site blocks link checkers although it runs one, how's that for hypocrisy? wink

Bill,
Once a year I use Xenu to check my sites links for errors, however I remark out the deny line for this use.

My worst fear is the slim chance that another Xenu will appear during the short interim ;)

Don

incrediBILL

12:23 am on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My worst fear is the slim chance that another Xenu will appear during the short interim

I have a copy of XENU and modified the binary (just type in new UA) to use a unique name for checking certain aspects of my bot blocker.

incrediBILL

12:32 am on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those interested in specifics, here's the details when PMOZ visited my site from 1&1 Internet:

74.208.16.* "Mozilla/5.0 (compatible; pmoz.info ODP link checker; +http://pmoz.info/doc/botinfo.htm)"

On their page they claim:

Visits from this process generally come from IPs 74.208.25.118 or 216.15.74.85.

Which wasn't the IPs that hit my site, so either the info is outdated, or they use other IPs as well.

Umbra

12:51 am on Jul 25, 2008 (gmt 0)

10+ Year Member



If the Pmoz checker removes listings automatically then it should be officially stated by DMOZ so that those who submit to the directory will know what to expect and can allow it access.

Perhaps I missed it, but anything less would seem unreasonable.

The developer of the bot has said the "two highest-impact bots ODP uses [are] the official robozilla bot, and my pmoz.info bot"

The problem you have is spam hosts tend to be cheap and most people are unaware of their reputation

I don't how spammy that host is, but the developer wrote "Yes, my tools are hosted on a shared server. I cannot control what trouble my fellow hostees get into".

[edited by: Umbra at 12:53 am (utc) on July 25, 2008]

dstiles

2:28 am on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And it uses a .info TLD? I thought only spammers used those!

Whatever, with variable and suspect IPs it's blocked.

I thought dmoz had died anyway. I was unable to update entries for a couple of years because of a duff submission form so I just gave up on it. As far as I can tell the only serious user of dmoz now is google, and right now even that's wrong.

Samizdata

9:13 am on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the official robozilla bot, and my pmoz.info bot

So in order to stay listed in DMOZ you have to cater for an unofficial robot that is not mentioned anywhere on the DMOZ site and which seems to be the personal project of an individual admin.

I don't recall this being mentioned when you submit a site, but then I haven't bothered lately.

...

Umbra

8:36 pm on Jul 25, 2008 (gmt 0)

10+ Year Member



As far as I can tell the only serious user of dmoz now is google, and right now even that's wrong.

Over on the Google forum, there's a thread about a Page Rank update. So I checked out the Google Directory. I don't remember how it was before, but now many directory pages have a low or zero Page Rank. Then over on Dmoz, also low Page Ranks for many subcategories. The PR update may still be in flux, but it would be interesting to keep on an eye on this. Assuming that Google applies its algorithms equally across the board, then perhaps this is a sign that directories ARE less important. Which would be a relief to anyone at the mercy of a bot like pmoz.

dstiles

8:43 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Because google is currently taking the description for one of our home pages from dmoz, despite it never having done so before, we have just re-submitted the correct description to dmoz (yes, we now use noodp). (Oddly, google only gets the description wrong for SOME keyword sets. For others it gets it right - go figure!)

In light of pmoz, which we now ban, it will be interesting to see what happens next. Still, not really holding breath over a dmoz update anyway - human editor for every one of millions of sites? Yeah, right!

janharders

8:50 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles, iirc, you can tell google wether it should take dmoz' description for the serps.

<meta name="robots" content="noodp" />

should do it's magic.

edit: ups, yeah, should have read more carefully :)