Welcome to WebmasterWorld Guest from 54.167.174.11

Forum Moderators: goodroi

Message Too Old, No Replies

DMOZ Robots.txt

Crawling no longer allowed

   
3:31 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



?\just tried to crawl a small category of the dmoz and surprise - they now disallow all bots except some of the big boys (another surprise, fast isn't big enough ;).

That's new. Obviously they try to save server load. Now i have to either email them or load/extract the rdf ... sigh.

To be added to our list of allowed robots, please email the staff programmer

You think they let a small, new niche search engine crawl a small category as a starting point?

3:34 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why not use the RDF file?:

rdf.dmoz.org/

3:39 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Why not use the RDF file?

Because i only need < 0.0001% of the categories - too much to fetch them manually though. ;)

3:49 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hehehehehe, directory.google.com has no robots.txt ... :)
3:51 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member trillianjedi is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hehehehehe, directory.google.com has no robots.txt ... :)

I was going to suggest that, then realised that it's so far out of date it may not be the best thing.

Some of the other dmoz mirrors, however, are bang up to date.

TJ

3:56 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's true but most of them rewrite the listed url's to cgi-bin/anyscript?id=123 and disallow that. Na, i just need the url's to start a bigger crawl so i can live with the out of date links ... i hope ... i'll see ... ;)
4:06 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This may seem like a bizarre suggestion, but why not mail the staff programmer? I think on something like this, a legitimate SE, even if small, would get a fairly quick response.
4:23 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>a legitimate SE, even if small, would get a fairly quick response

Absolutely hutcheson. But i really don't think they bother changing their robots.txt just for me - not yet a search engine - i'm at the beginning of building it, the domain has been registered this morning ... ;)

4:30 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Rumour has it that one of the causes of the major technical server difficulties at dmoz in the last year was over-enthusiastic spidering by robots. I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.
4:38 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.

I'm one of those webmasters who are happy with that fact. I perfectly accept and understand what they do. I wasn't moaning. ;)

5:05 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, if you don't have a search engine, but only a search engine research project, then directory.google.com is the place to spider. Google has a few (thousand) more servers than the ODP.

Or grab a RDF and use it for data. Rumor is, that's what Google did.

5:38 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I should react to another Bolshevik canard. (ODP is the real revolution, brothers, the others are just banditti and wannabe-warlord-capitalist oppressors!)

It has nothing to do with "big boys" or "little boys". AOL was a "big boy" when they started using the ODP. Google was NOT. We were glad to see both of them.

It has to do with "bots that are known to the staff to be legitimate and well-behaved." Fast either hasn't asked, or was observed to be misbehaving.

And, by the way, the cretins who were selling scripts so all their littermates could spider the ODP EVERY DAY -- are the REAL jerks who REALLY got the legitimate "little guys" shut out. For five years, the ODP was completely open to spiders. So -- put your abuse and contempt where it so deservedly belongs.

I'm not staff, and can't read their mind, but it's obvious that their concern is server load, and their concrete questions will be along the lines of:

-- How often is the spider going to run?
-- How hard will it pound the server?
-- Is its purpose to do something antisocial and sleazy, like collecting e-mail addresses or expired domains for resale to bottom-feeding spammers?
-- Who else will ever use the spider?
-- Do you really need to hit dmoz.org, or will a mirror or RDF be good enough?

If Fast cares, they could undoubtedly get official permission to spider. They are a known SE, with a legitimate need, and they could surely make their bot behave.