Forum Moderators: phranque

Message Too Old, No Replies

Needed - URL Sorting Application...

...capable of handle 15,000 + URLs.

         

pendanticist

8:01 pm on Jan 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Greetings,

I've amassed about 7,000 unsorted URLS (which I've annotated) that I want to include in my next update. This update will add about 20 new indices as well.

I'm a one-person operation whose domain currently houses some 5,000 annotated URLs in 160+ indices.

  • I need a piece of software that will sort out the redundant URLs between those 5,000 currently uploaded and those 7,000 queued.

  • What I'm thinking about should be able to run all the URLs (and the accompanying verbiage) in one file, so that file might be verrrry hefty.

  • Validation of URLs is not an issue - all are valid.

  • Preferably free.

    and

  • Runs on w2k.

    A couple of associated questions:

  • Does FP-2000 have such a feature? I have FP-2000 (full monty) that I've never used. (I prefer hand coding, am not fond of one-time-only situationally specific learning curves or the code bloat FP adds.)

  • Would I be better off using Excel?

  • If so, will I need to remove the annotations for sorting purposes?

    Thank You.

    Pendanticist.

  • jatar_k

    8:37 pm on Jan 12, 2003 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member



    this could be done with a script. Either storing the urls in a flatfile or db and having one or a couple of scripts to manage the list.

    Take a look at sourceforge.net, they have tons of free scripts for various tasks. There may even be something there that does exactly what you need. If not you may have to customize something that is close.

    Macguru

    8:43 pm on Jan 12, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    If you dont find a free script to do that, consider FileMaker Pro instead of Excel. I use it for almost everything.

    pendanticist

    10:17 pm on Jan 12, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    jatar_k - this could be done with a script. Either storing the urls in a flatfile or db and having one or a couple of scripts to manage the list.

    Uh, what are flatfile and db (database)?

    Take a look at sourceforge.net, they have tons of free scripts for various tasks. There may even be something there that does exactly what you need. If not you may have to customize something that is close.

    I'm over there now. A bit tough to wade thru if you don't understand half the titles/descriptions. (Computer/Software acronyms aren't my strong suit)

    Customize? <chuckle> That denotes a level of expertise I don't have. My experience has been more on the line of download, install and run.

    Macguru - If you dont find a free script to do that, consider FileMaker Pro instead of Excel. I use it for almost everything.


    I'm taking a look at that also. While not free (can't find the price right off), it may be worth a download to see how difficult it is to use, much less have the update done before the time runs out.

    Thanks, I appreciate it.

    If anyone else has something to contribute, feel free. I could use more suggestions.

    Pendanticist.

    victor

    10:21 pm on Jan 12, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    If you put all your 7000 URLs into one text (=flat) file -- ie one URL per line -- can you give an example of what a few lines in that file would look like?

    Because...

    [widgets-blue.com...]
    [widgets-red.com...]

    ...would be trivial to dedupe and sort. But if there is a lot of picking through the "annotations" and lots of rules about what is and isn't a duplicate, it'll be that much harder.

    pendanticist

    1:11 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    victor - If you put all your 7000 URLs into one text (=flat) file -- ie one URL per line -- can you give an example of what a few lines in that file would look like?

    Yep.

    Note: I've removed the http stuff in all instances.

    Currently, they either literally look like this:

    yvwiiusdinvnohii.net/other.html - Other Peoples' Paths
    www.wri.org/ - World Resources Institute
    www.law.ecel.uwa.edu.au/intlaw/arctic_and_antarctic.htm - Arctic and Antarctic Links
    www.antdiv.gov.au/ - Australian Antarctic Page
    www.geo.ed.ac.uk/home/giswww.html - WWW Resources: Geographic Information Systems (GIS)
    www.alphacdc.com/ien/subject.html - Indigenous Environmental Network

    X's 7,000 - representing those needing annotations.

    ...or this:

    www.mip.berkeley.edu/cilc/bibs/toc.html">Bibliographies of Northern and Central California Indians</A><BR>The California Indian Library Collections has collected, duplicated, assembled, and shipped more than 11,000 textual documents, nearly 25,000 photographs, and over 3,400 audio tapes. There is reward and satisfaction in having prepared over 17,000 manuscript pages for finding guides to the collections and publishing these in 44 volumes. Now a Native Californian in a remote area of northern California may find a photograph of his or her grandmother or hear, for the first time, his grandfather sing or tell a story. Researchers in rural areas are using the collections for legal defense as well as research material for documentation of an important period in California history.

    www.blackfeetnation.com/">Blackfeet Nation -- Welcome to the Official Site of the Blackfeet Nation</A><BR>Official Site of the Blackfeet Nation based in Browning Montana.

    www.tlingit-haida.org/">CCTHITA.org</A><BR>Central Council is the tribal government representing over 24,000 Tlingit and Haida indians worldwide.

    www.sioux.org/">Cheyenne River Sioux Tribe/Native American/Indian Government</A><BR>The Cheyenne River Sioux Tribe is proud to introduce it's Website to the Internet! This site has been re-designed, programmed, prepared, and written by CRST Chairman, Gregg Bourland.

    www.chiefs-of-ontario.org/">Chiefs of Ontario</A><BR>In March of 1975, at the First Annual All Ontario Chiefs Conference, a joint First Nations Association Coordinating Committee was formed, constituting an unincorporated federation of the four major Ontario First Nation organizations. The purpose of the committee was to provide a single Ontario representative to the Assembly of First Nations (then, the National Indian Brotherhood). From this committee emerged the Chiefs of Ontario office whose basic purpose is to enable the political leadership to discuss and to decide on regional, provincial and national priorities affecting First Nation people in Ontario and to provide a unified voice on these issues. The Chiefs of Ontario office has become a vehicle to facilitate discussions between the Ontario government and First Nation people in Ontario.

    X's 5,000 - representing those already residing on my domain.

    I'm thinking - in the above case, I'd have to either strip out the URL, sort for duplication, re-merge the annotation with the stripped URL while maintaining the sorting process and then co-mingle with the newer URLs I wish to add...once they are annotated.

    As you can see, there are varying degrees of annotation, which adds to the potential 'confusion' surrounding the update.

    Because...
    <Snipped http stuff>
    www.widgets-blue.com
    www.widgets-red.com/about.html

    ...would be trivial to dedupe and sort. But if there is a lot of picking through the "annotations" and lots of rules about what is and isn't a duplicate, it'll be that much harder.

    So true and about what I figured. The only 'rules' I have are weeding out sites with similar/duplicate content - as opposed to sorting the sites I have from those I've been collecting.

    If I go with just blank URLs, how would they be kept separate, much less recalling exactly where that URL points - is the quandary.

    While many URLs are easily recognized such as www.fbi.gov, other, longer URLs, can be quite perplexing when it comes to any recognition factors.

    I should stress the term 'duplication' I use, actually relates to the URL itself. Such to say, on one hand I have 5,000 annotated links while the other hand contains 7,000 un-annotated links. All of which need sorting for duplicity, categorizing, annotating and finally uploading in co-mingled fashion.

    Think of my domain in terms of a mini-DMOZ with a staunch academic focus that will more than double in size.

    My domain is: a .com, has held a steady PR6 as long as PR has been PR, has been featured in hard cover print and is listed by some of the finest organizations throughout the Academic World - including such organizations as the United Nations Criminal Justice Information Network. It is also crawled (on a regular basis) by just about every major (and not so major) SE bot known.

    I've wanted to complete this update for over two years, but I needed to finish my Bachelors Degree first (priorities 'ya know). Now that I've achieved that goal, I'm onto this one.

    I trust this gives you an idea of my predicament? :o

    Thanks.

    Pendanticist.

    pendanticist

    1:47 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Macguru,

    I'm downloading FileMaker now.

    Good thing I'm DSL, 'cause that puppy is 58.8 MB! Yeowzer!

    Pendanticist.

    andreasfriedrich

    1:59 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Seems to be a job for a practical extraction and report language ;)

    bcc1234

    2:57 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Seems to be a job for a practical extraction and report language

    Too bad it's used for anything else but these kinds of jobs :)

    andreasfriedrich

    3:12 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    One fine day youŽll gonna want a pearl too ;)

    victor

    10:13 am on Jan 13, 2003 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Here's a quick and dirty example of doing this in a language called Rebol (www.rebol.com).

    It assumes the data is in a file called dedupe-in.txt and that the URL ends at the first space (This isn't true for your second example, so you may need to adjust the data or tweak the program). You may also need to massage URLs with spaces in them to be %20s instead.

    It writes a file called dedupe-out.txt: this contains deduplicated URLs in alphabetical order. If we had a URL more than once, the annotations come from the first instance.

    WebmasterWorld will trash my indentation, but it is nicely laid out in the original


    rebol []

    ;; get the data in
    raw-data: read/lines %dedupe-in.txt

    ;; extact the URLs
    rest-of-line: copy []
    urls: copy []
    foreach line raw-data [
    temp: parse/all line " "
    append urls first temp
    append rest-of-line first temp
    append/only rest-of-line form next temp
    ]

    ;; Sort and deduplicate them
    urls: unique sort urls
    rest-of-line: make hash rest-of-line ;; for speedy access

    Delete results file, if it exists

    error? try [delete %dedupe-out.txt]

    ;; write deduplicated list

    foreach url urls [write/lines/append
    %dedupe-out.txt
    join url [
    " "
    select rest-of-line url
    ]
    ]