homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
DMOZ Robots.txt
Crawling no longer allowed
Yidaki




msg:1528138
 3:31 pm on Oct 14, 2003 (gmt 0)

?\just tried to crawl a small category of the dmoz and surprise - they now disallow all bots except some of the big boys (another surprise, fast isn't big enough ;).

That's new. Obviously they try to save server load. Now i have to either email them or load/extract the rdf ... sigh.

To be added to our list of allowed robots, please email the staff programmer

You think they let a small, new niche search engine crawl a small category as a starting point?

 

bcolflesh




msg:1528139
 3:34 pm on Oct 14, 2003 (gmt 0)

Why not use the RDF file?:

rdf.dmoz.org/

Yidaki




msg:1528140
 3:39 pm on Oct 14, 2003 (gmt 0)

>Why not use the RDF file?

Because i only need < 0.0001% of the categories - too much to fetch them manually though. ;)

Yidaki




msg:1528141
 3:49 pm on Oct 14, 2003 (gmt 0)

Hehehehehe, directory.google.com has no robots.txt ... :)

trillianjedi




msg:1528142
 3:51 pm on Oct 14, 2003 (gmt 0)

Hehehehehe, directory.google.com has no robots.txt ... :)

I was going to suggest that, then realised that it's so far out of date it may not be the best thing.

Some of the other dmoz mirrors, however, are bang up to date.

TJ

Yidaki




msg:1528143
 3:56 pm on Oct 14, 2003 (gmt 0)

That's true but most of them rewrite the listed url's to cgi-bin/anyscript?id=123 and disallow that. Na, i just need the url's to start a bigger crawl so i can live with the out of date links ... i hope ... i'll see ... ;)

hutcheson




msg:1528144
 4:06 pm on Oct 14, 2003 (gmt 0)

This may seem like a bizarre suggestion, but why not mail the staff programmer? I think on something like this, a legitimate SE, even if small, would get a fairly quick response.

Yidaki




msg:1528145
 4:23 pm on Oct 14, 2003 (gmt 0)

>a legitimate SE, even if small, would get a fairly quick response

Absolutely hutcheson. But i really don't think they bother changing their robots.txt just for me - not yet a search engine - i'm at the beginning of building it, the domain has been registered this morning ... ;)

John_Caius




msg:1528146
 4:30 pm on Oct 14, 2003 (gmt 0)

Rumour has it that one of the causes of the major technical server difficulties at dmoz in the last year was over-enthusiastic spidering by robots. I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.

Yidaki




msg:1528147
 4:38 pm on Oct 14, 2003 (gmt 0)

I'd have thought that most webmasters here would welcome the fact that dmoz appears to have put in place measures to ensure that submissions and public page updates work properly again.

I'm one of those webmasters who are happy with that fact. I perfectly accept and understand what they do. I wasn't moaning. ;)

hutcheson




msg:1528148
 5:05 pm on Oct 14, 2003 (gmt 0)

Yes, if you don't have a search engine, but only a search engine research project, then directory.google.com is the place to spider. Google has a few (thousand) more servers than the ODP.

Or grab a RDF and use it for data. Rumor is, that's what Google did.

hutcheson




msg:1528149
 5:38 pm on Oct 14, 2003 (gmt 0)

I should react to another Bolshevik canard. (ODP is the real revolution, brothers, the others are just banditti and wannabe-warlord-capitalist oppressors!)

It has nothing to do with "big boys" or "little boys". AOL was a "big boy" when they started using the ODP. Google was NOT. We were glad to see both of them.

It has to do with "bots that are known to the staff to be legitimate and well-behaved." Fast either hasn't asked, or was observed to be misbehaving.

And, by the way, the cretins who were selling scripts so all their littermates could spider the ODP EVERY DAY -- are the REAL jerks who REALLY got the legitimate "little guys" shut out. For five years, the ODP was completely open to spiders. So -- put your abuse and contempt where it so deservedly belongs.

I'm not staff, and can't read their mind, but it's obvious that their concern is server load, and their concrete questions will be along the lines of:

-- How often is the spider going to run?
-- How hard will it pound the server?
-- Is its purpose to do something antisocial and sleazy, like collecting e-mail addresses or expired domains for resale to bottom-feeding spammers?
-- Who else will ever use the spider?
-- Do you really need to hit dmoz.org, or will a mirror or RDF be good enough?

If Fast cares, they could undoubtedly get official permission to spider. They are a known SE, with a legitimate need, and they could surely make their bot behave.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved