Forum Moderators: phranque
I've amassed about 7,000 unsorted URLS (which I've annotated) that I want to include in my next update. This update will add about 20 new indices as well.
I'm a one-person operation whose domain currently houses some 5,000 annotated URLs in 160+ indices.
and
A couple of associated questions:
Thank You.
Pendanticist.
Take a look at sourceforge.net, they have tons of free scripts for various tasks. There may even be something there that does exactly what you need. If not you may have to customize something that is close.
jatar_k - this could be done with a script. Either storing the urls in a flatfile or db and having one or a couple of scripts to manage the list.
Uh, what are flatfile and db (database)?
Take a look at sourceforge.net, they have tons of free scripts for various tasks. There may even be something there that does exactly what you need. If not you may have to customize something that is close.
I'm over there now. A bit tough to wade thru if you don't understand half the titles/descriptions. (Computer/Software acronyms aren't my strong suit)
Customize? <chuckle> That denotes a level of expertise I don't have. My experience has been more on the line of download, install and run.
Macguru - If you dont find a free script to do that, consider FileMaker Pro instead of Excel. I use it for almost everything.
Thanks, I appreciate it.
If anyone else has something to contribute, feel free. I could use more suggestions.
Pendanticist.
Because...
[widgets-blue.com...]
[widgets-red.com...]
...would be trivial to dedupe and sort. But if there is a lot of picking through the "annotations" and lots of rules about what is and isn't a duplicate, it'll be that much harder.
victor - If you put all your 7000 URLs into one text (=flat) file -- ie one URL per line -- can you give an example of what a few lines in that file would look like?
Yep.
Note: I've removed the http stuff in all instances.
Currently, they either literally look like this:
yvwiiusdinvnohii.net/other.html - Other Peoples' Paths
www.wri.org/ - World Resources Institute
www.law.ecel.uwa.edu.au/intlaw/arctic_and_antarctic.htm - Arctic and Antarctic Links
www.antdiv.gov.au/ - Australian Antarctic Page
www.geo.ed.ac.uk/home/giswww.html - WWW Resources: Geographic Information Systems (GIS)
www.alphacdc.com/ien/subject.html - Indigenous Environmental Network
X's 7,000 - representing those needing annotations.
...or this:
www.mip.berkeley.edu/cilc/bibs/toc.html">Bibliographies of Northern and Central California Indians</A><BR>The California Indian Library Collections has collected, duplicated, assembled, and shipped more than 11,000 textual documents, nearly 25,000 photographs, and over 3,400 audio tapes. There is reward and satisfaction in having prepared over 17,000 manuscript pages for finding guides to the collections and publishing these in 44 volumes. Now a Native Californian in a remote area of northern California may find a photograph of his or her grandmother or hear, for the first time, his grandfather sing or tell a story. Researchers in rural areas are using the collections for legal defense as well as research material for documentation of an important period in California history.www.blackfeetnation.com/">Blackfeet Nation -- Welcome to the Official Site of the Blackfeet Nation</A><BR>Official Site of the Blackfeet Nation based in Browning Montana.
www.tlingit-haida.org/">CCTHITA.org</A><BR>Central Council is the tribal government representing over 24,000 Tlingit and Haida indians worldwide.
www.sioux.org/">Cheyenne River Sioux Tribe/Native American/Indian Government</A><BR>The Cheyenne River Sioux Tribe is proud to introduce it's Website to the Internet! This site has been re-designed, programmed, prepared, and written by CRST Chairman, Gregg Bourland.
www.chiefs-of-ontario.org/">Chiefs of Ontario</A><BR>In March of 1975, at the First Annual All Ontario Chiefs Conference, a joint First Nations Association Coordinating Committee was formed, constituting an unincorporated federation of the four major Ontario First Nation organizations. The purpose of the committee was to provide a single Ontario representative to the Assembly of First Nations (then, the National Indian Brotherhood). From this committee emerged the Chiefs of Ontario office whose basic purpose is to enable the political leadership to discuss and to decide on regional, provincial and national priorities affecting First Nation people in Ontario and to provide a unified voice on these issues. The Chiefs of Ontario office has become a vehicle to facilitate discussions between the Ontario government and First Nation people in Ontario.
X's 5,000 - representing those already residing on my domain.
I'm thinking - in the above case, I'd have to either strip out the URL, sort for duplication, re-merge the annotation with the stripped URL while maintaining the sorting process and then co-mingle with the newer URLs I wish to add...once they are annotated.
As you can see, there are varying degrees of annotation, which adds to the potential 'confusion' surrounding the update.
Because...
<Snipped http stuff>
www.widgets-blue.com
www.widgets-red.com/about.html...would be trivial to dedupe and sort. But if there is a lot of picking through the "annotations" and lots of rules about what is and isn't a duplicate, it'll be that much harder.
So true and about what I figured. The only 'rules' I have are weeding out sites with similar/duplicate content - as opposed to sorting the sites I have from those I've been collecting.
If I go with just blank URLs, how would they be kept separate, much less recalling exactly where that URL points - is the quandary.
While many URLs are easily recognized such as www.fbi.gov, other, longer URLs, can be quite perplexing when it comes to any recognition factors.
I should stress the term 'duplication' I use, actually relates to the URL itself. Such to say, on one hand I have 5,000 annotated links while the other hand contains 7,000 un-annotated links. All of which need sorting for duplicity, categorizing, annotating and finally uploading in co-mingled fashion.
Think of my domain in terms of a mini-DMOZ with a staunch academic focus that will more than double in size.
My domain is: a .com, has held a steady PR6 as long as PR has been PR, has been featured in hard cover print and is listed by some of the finest organizations throughout the Academic World - including such organizations as the United Nations Criminal Justice Information Network. It is also crawled (on a regular basis) by just about every major (and not so major) SE bot known.
I've wanted to complete this update for over two years, but I needed to finish my Bachelors Degree first (priorities 'ya know). Now that I've achieved that goal, I'm onto this one.
I trust this gives you an idea of my predicament? :o
Thanks.
Pendanticist.
It assumes the data is in a file called dedupe-in.txt and that the URL ends at the first space (This isn't true for your second example, so you may need to adjust the data or tweak the program). You may also need to massage URLs with spaces in them to be %20s instead.
It writes a file called dedupe-out.txt: this contains deduplicated URLs in alphabetical order. If we had a URL more than once, the annotations come from the first instance.
WebmasterWorld will trash my indentation, but it is nicely laid out in the original
rebol [];; get the data in
raw-data: read/lines %dedupe-in.txt;; extact the URLs
rest-of-line: copy []
urls: copy []
foreach line raw-data [
temp: parse/all line " "
append urls first temp
append rest-of-line first temp
append/only rest-of-line form next temp
];; Sort and deduplicate them
urls: unique sort urls
rest-of-line: make hash rest-of-line ;; for speedy accessDelete results file, if it exists
error? try [delete %dedupe-out.txt]
;; write deduplicated list
foreach url urls [write/lines/append
%dedupe-out.txt
join url [
" "
select rest-of-line url
]
]