DMOZ Robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

DMOZ Robots.txt

Crawling no longer allowed

Yidaki

3:31 pm on Oct 14, 2003 (gmt 0)

?\just tried to crawl a small category of the dmoz and surprise - they now disallow all bots except some of the big boys (another surprise, fast isn't big enough ;).

That's new. Obviously they try to save server load. Now i have to either email them or load/extract the rdf ... sigh.

To be added to our list of allowed robots, please email the staff programmer

You think they let a small, new niche search engine crawl a small category as a starting point?

bcolflesh

3:34 pm on Oct 14, 2003 (gmt 0)

Why not use the RDF file?:

rdf.dmoz.org/

Yidaki

3:39 pm on Oct 14, 2003 (gmt 0)

>Why not use the RDF file?

Because i only need < 0.0001% of the categories - too much to fetch them manually though. ;)

Yidaki

3:49 pm on Oct 14, 2003 (gmt 0)

Hehehehehe, directory.google.com has no robots.txt ... :)

trillianjedi

3:51 pm on Oct 14, 2003 (gmt 0)

Hehehehehe, directory.google.com has no robots.txt ... :)

I was going to suggest that, then realised that it's so far out of date it may not be the best thing.

Some of the other dmoz mirrors, however, are bang up to date.

Yidaki

3:56 pm on Oct 14, 2003 (gmt 0)

That's true but most of them rewrite the listed url's to cgi-bin/anyscript?id=123 and disallow that. Na, i just need the url's to start a bigger crawl so i can live with the out of date links ... i hope ... i'll see ... ;)

hutcheson

4:06 pm on Oct 14, 2003 (gmt 0)

This may seem like a bizarre suggestion, but why not mail the staff programmer? I think on something like this, a legitimate SE, even if small, would get a fairly quick response.

Yidaki

4:23 pm on Oct 14, 2003 (gmt 0)

>a legitimate SE, even if small, would get a fairly quick response

Absolutely hutcheson. But i really don't think they bother changing their robots.txt just for me - not yet a search engine - i'm at the beginning of building it, the domain has been registered this morning ... ;)

John_Caius

4:30 pm on Oct 14, 2003 (gmt 0)

Rumour has it that one of the causes of the major technical server difficulties at dmoz in the last year was over-enthusiastic spidering by robots. I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.

Yidaki

4:38 pm on Oct 14, 2003 (gmt 0)

I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.

I'm one of those webmasters who are happy with that fact. I perfectly accept and understand what they do. I wasn't moaning. ;)

hutcheson

5:05 pm on Oct 14, 2003 (gmt 0)

Yes, if you don't have a search engine, but only a search engine research project, then directory.google.com is the place to spider. Google has a few (thousand) more servers than the ODP.

Or grab a RDF and use it for data. Rumor is, that's what Google did.

hutcheson

5:38 pm on Oct 14, 2003 (gmt 0)

I should react to another Bolshevik canard. (ODP is the real revolution, brothers, the others are just banditti and wannabe-warlord-capitalist oppressors!)

It has nothing to do with "big boys" or "little boys". AOL was a "big boy" when they started using the ODP. Google was NOT. We were glad to see both of them.

It has to do with "bots that are known to the staff to be legitimate and well-behaved." Fast either hasn't asked, or was observed to be misbehaving.

And, by the way, the cretins who were selling scripts so all their littermates could spider the ODP EVERY DAY -- are the REAL jerks who REALLY got the legitimate "little guys" shut out. For five years, the ODP was completely open to spiders. So -- put your abuse and contempt where it so deservedly belongs.

I'm not staff, and can't read their mind, but it's obvious that their concern is server load, and their concrete questions will be along the lines of:

-- How often is the spider going to run?
-- How hard will it pound the server?
-- Is its purpose to do something antisocial and sleazy, like collecting e-mail addresses or expired domains for resale to bottom-feeding spammers?
-- Who else will ever use the spider?
-- Do you really need to hit dmoz.org, or will a mirror or RDF be good enough?

If Fast cares, they could undoubtedly get official permission to spider. They are a known SE, with a legitimate need, and they could surely make their bot behave.

DMOZ Robots.txt

Crawling no longer allowed

Yidaki

bcolflesh

Yidaki

Yidaki

trillianjedi

Yidaki

hutcheson

Yidaki

John_Caius

Yidaki

hutcheson

hutcheson

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week