homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Asia and Pacific Region
Forum Library, Charter, Moderators: bill

Asia and Pacific Region Forum

New Australian directory/crawler

 1:04 pm on May 28, 2003 (gmt 0)

The National Library of Australia has been spidering sites that have consented to have their content archived in a local version of the Wayback Machine. The requirement for Australia-relevant content is important.


There is a list of exclusions and only a smattering of commercial sites are being accepted.

- Ash

[edited by: Woz at 1:58 am (utc) on May 29, 2003]
[edit reason] made URL live [/edit]



 6:36 pm on May 28, 2003 (gmt 0)

They have been adding to their archive for quite some time now... do you have a begin date on that? The first request to a site they wanted to include that I know of was at least a year ago... but as far as I know they do notify you if they wish to include you.


 8:28 pm on May 28, 2003 (gmt 0)

Any information on their spider?


 1:00 am on May 29, 2003 (gmt 0)

Any information on their spider?

Yeah - or whether they respect robots.txt (what user agent string?) so that we can avoid cases where "generous" people submit your site for you...


 2:00 am on May 29, 2003 (gmt 0)

From reading some of the guidelines [pandora.nla.gov.au] it appears that inclusion is by invitaion only and after licence has been granted to the library to spider your site. A far more ethical process in my opinion.



 2:44 am on May 29, 2003 (gmt 0)

Weird. I just went to this site, and all I got was a page with one link to the page I was on. Google's cache shows a directory type structure but all i get is a self-referencing link :(

Anyone else seeing that?


 8:27 am on May 29, 2003 (gmt 0)

Same here, Keeper, with Opera 7 presenting itself as Opera. Opera with the magic word MSIE in the ua-string shows a nice list of documents, manuals and other files. Obviously they've hired some clueless designer to build the site. Interesting doctype, too:

<!DOCTYPE HTML PUBLIC "-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//EN" "hmpro6.dtd">


 8:42 am on May 29, 2003 (gmt 0)

Kinda interesting that it has a link to the WayBack Machine. Maybe the same software?


 12:28 pm on May 29, 2003 (gmt 0)

Try this for further information:

The form to submit indexed Australian Internet publications is here:

Read the guidelines carefully (keep australia clean! :)).

The only experience that I have with it is that they found an information site that they wanted to include and sent a request for permission to archive it.

Adding: Well ok.. so click the links offered to find out more.


 1:01 pm on May 29, 2003 (gmt 0)

I don't think I can quote email from a mailing list here but it seems that the Wayback Machine founder Brewster Kahle did offer the code to the NLA but they did not take it.

Apparently only a few sites (~700) have been spidered in the past year. I noticed their spider (a machine that runs HTTrack) visit my site a few times. It has an entry in the state-run directory, which was being spidered, and HTTrack was probably set to spider one external level.

I have since filled out the spidering request form which is linked from Pandora's main page as "Notification Form". Not for my business site (which is unlikely to be spidered, although they have chosen some iconic Aussie businesses as samples), but some noncommercial resources such as a magazine that I help to edit.

- Ash

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Asia and Pacific Region
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved