Forum Moderators: open

Message Too Old, No Replies

How can Google spider DMOZ?

It takes me 10-20 tries to get the home page...

         

swerve

8:49 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



When trying to access any dmoz.org pages, I almost always get a "page not found" error. It usually takes 10-20 tries before I get a page to display. I'm glad I'm not Googlebot, trying to request 300,000+ pages....

taxpod

8:55 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



Good point!

jeffb711

9:01 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



I'm pretty sure dmoz is running personal web server with access as their db using an odbc connection.

ncw164x

9:05 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Might even download the rdf file and spider it from Google servers

jrobbio

9:09 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



That would make more sense ncw.

rfgdxm1

9:18 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google may be using a backdoor to spider that they know about, but the public doesn't. The editors use such to get access to the ODP without timeouts. Also, Google may just spider the mirror [ch.dmoz.org,...] and pretend that is dmoz.org. Try the mirror yourself. Nice and quick.

ct2000

9:27 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



yeah but no PR ..

jrobbio

9:33 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



DMOZ has so many mirrors and a surefire reason for people wanting to get listed. Plain and simple, if the site doesn't work, then it won't get crawled. Why should it give DMOZ special treatment with a backdoor. It may be big and webmasters give it huge importance, but at the end of the day if it can't access the site, then it won't or it will just keep trying.
As someone said if a pr6 goes down, then Google will recheck it until it works again.

bakedjake

9:38 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm fairly certain Google uses the DMOZ RDF dump. :) It doesn't spider the site.

The DMOZ Resource Zone actually has a dedicated RDF forum, which can provide you with more info if you're interested.

rfgdxm1

9:51 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>I'm fairly certain Google uses the DMOZ RDF dump. :) It doesn't spider the site.

Wrong. In that long period where the ODP wasn't doing RDF dumps, I was seeing new sites I added as an ODP editor using the link: command after the Google dances. Google does spider the ODP. As for that person who commented the mirror had a PR0: Google may have just hand tweaked things such that it pretends when it crawls the mirror it is really dmoz.org. Obviously, Google considers the ODP very important: they use it as their own directory. Thus, quite plausible Google would go to special effort to crawl the ODP.

Critter

9:58 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



Maybe Google spidering DMOZ is the reason that DMOZ is slow :)

Perhaps someone over at Google pulled out the stops on the page-request limiter for DMOZ, and made poor newhoo sh*t the bed.

Hehe (could happen)

Peter

ncw164x

10:03 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>quite plausible Google would go to special effort to crawl the ODP<<

If that is the case then why does Dmoz use Robozilla to check the sites in the directory when they could use the Google spidered version which would have any 404's or errors removed.

Dmoz give the rdf file free and as a favour back from Google have a clean database of sites - but this does not happen?

bakedjake

10:08 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I thought for sure Google would grab the RDF, then clean it, then use it as their directory. But rfg, I defer to you as the elder editor. :)

If they don't grab the RDF, that's why they can't submit a clean one back to DMOZ, nevermind any political reasons involved.

jrobbio

10:10 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



ncw Maybe DMOZ has never asked for the favour in return, it sounds a clever idea, but it would mean that DMOZ has burdened itself onto Google like many others and that is dangerous for anyone.

rfgdxm1

10:20 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Dmoz give the rdf file free and as a favour back from Google have a clean database of sites - but this does not happen?

ODP policy is never to remove sites by automation. We do everything by hand review. One problem is that Google could find dead sites, but they were just temporarily offline. At the ODP, we want to make sure they are really, truly dead before deleting them.

ncw164x

11:09 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>At the ODP, we want to make sure they are really, truly dead before deleting them<<

Yes your right rfgdxm1, fair comment

steveb

1:58 am on Apr 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



baked jake, they crawl it and they use the rdf, both.