Forum Moderators: open

Message Too Old, No Replies

How to improve site performance of dmoz.org

... a constructive approach

         

tombola

9:09 am on Sep 23, 2003 (gmt 0)

10+ Year Member



It's a fact that sometimes you can access dmoz.org easy, sometimes you can't at all.

Are the folks at dmoz.org (I mean tech people, NOT editors) aware of the fact that a great part of traffic doesn't come from regular (individual) users, but from sites that use live ODP data?
These sites query the dmoz.org server instead of caching files, or using the RDF dump.

See:
http:**dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Sites_Using_ODP_Data/

Here are some possibilities to improve site performance of dmoz.org:

1. Every site that uses live ODP data should access not the main server (dmoz.org) but another server.
To identify such a site, all sites that use live ODP data must use a particular user-agent.
Just deny access to unidentified bots, spiders, programs etc.

2. As only a few sites use the RDF dump because it's too huge to manage, it would be a better idea to provide live XML feed (from another server). (see Amazon)

3. If setting up another server is not an option, why not force sites that use ODP data to cache files (deny access to the server when a particular web site asks the same page/category more than once a day).

What do you think?...

hutcheson

1:32 pm on Sep 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Are the folks at dmoz.org (I mean tech people, NOT editors) aware of the fact that a great part of traffic doesn't come from regular (individual) users, but from sites that use live ODP data?

My understanding (and this may be wrong, because the techies spend more time working on the problems than telling us about them) is that:

1) Individual users coming through screen-scrapers is not believed to be a load issue. an individual user typing dmoz.org/Arts is no different from an individual user typing my-dmoz-scraper.com/Arts .
2) There is a new public-server-caching-and-synchronization
system.
3) It is not fully debugged,
4) Some editors think it may not be an optimal algorithm anyway -- for whatever THAT'S worth (not all editors are programmers, and of the ones that are, I don't know that they're all GOOD ones). There's some discussion of this internally, but I'm not sure the tech people yet consider this concern a high priority.
5) I fear that the caching algorithm will break down in the face of persistent rogue spidering, (LRU cache VS systematic accesses of every single page, FTR)
6) I'm sure the techies are sharp enough to realize #5. I'm not sure what they're planning to do about it, or whether they'll tell me when they do it.
7) There are still rumors of abusive spiders active,
8) The synchronization behavior we are seeing is compatible with #5 and #7 above.

windharp

2:41 pm on Sep 23, 2003 (gmt 0)

10+ Year Member



Everybody agrees, that there is a lot of space for optimization in the system as it runs right now. I am quite sure that the situation as it is right now will not stay like this forever. Hutcheson is right - traffic is not coming from prople browsing the ODP by a redirector. Lots of traffic comes from people actually spidering the ODP, not complying with the robots.txt. To get parts of the directory (where you dont want to get the whole RDF), spidering is okay - if it is done not to frequently and it is done with limited bandwidth.

Some more thoughts apart from what hutcheson said:

- Remember that surfers come with new browsers and unknown agents all the time. Especially editors do that, lots of them use nonstandard browsers. And Editor tools like to do that, too.

- Some external editor tools that are frequently used have to access pages without caching them for several reasons -> You would have to implement exceptions. That always is a bad thing for speed and reliability.

- Keep in mind that the tech people are most likely not totally familiar with that kind of problem - lots of HTML pages that are constantly changing and lots of people accessing them to read. That is not an everyday problem.

rfgdxm1

2:38 am on Sep 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why pray tell doesn't dmoz.org have a link to [ch.dmoz.org...] (which is a public mirror server) on the homepage? Same data.

hutcheson

2:37 pm on Sep 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IMO an excellent suggestion. I've copied it to our internal Bugs&Features forum.

Caveats:
1) This won't help submission problems, which have to drill down to the editors' machine to get to the editing databases.
2) When times are rotten (for public access and updates), the mirrors may be affected (perhaps not in the same way as the main page).
3) Right now, times are rotten for the public access, and that's going to be a higher priority for our tech staff than changes to the home page.

But next time there are problems, this little feature might ameliorate some of the frustration.