analyzing distribution of the duplicate values in a list

hi guys

this reply is obliquely relevant to this post [webmasterworld.com], which relates to a specific question i have ... i have a file with over 5 million lines containing exactly 36 characters (only [ACGT]) in each line ... i want to obtain information about the number of replicate lines ... so i found a MODULE at cpan (List::MoreUtils) that provides the code for the function "uniq" and used it in the following "testuniq" script :
-----------------------------
#!/usr/bin/perl

use strict;
use warnings;
use List::MoreUtils qw( uniq );
open (SEQS, '<5Mseqs');
print "this is the beginning of the 5meg unique test\n";
my @array = (<SEQS>);
my $uniquecount = uniq( @array );
print "these are the numer of unique lines in the 5,130,912 line file: $uniquecount \n";
print "this is the end\n";
---------------------------

in less than 60 seconds it printed the following:

------------------------

[root@N1233 List-MoreUtils-0.22]# perl testuniq
this is the beginning of the 5meg unique test
these are the numer of unique lines in the 5,130,912 line file: 4342967
this is the end

------------------------

compared to my own amateurish script without the "uniq" function this was amazingly efficient ... however, now i want some more information and am stymied about how to get it ... from the cpan page:
-------------------------
uniq LIST

Returns a new list by stripping duplicate values in LIST. The order of elements in the returned list is the same as in LIST. In scalar context, returns the number of unique elements in LIST.

my @x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 1 2 3 5 4
my $x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 5

--------------------------

but what about the the "duplicate values in LIST" ... i am actually very interested in knowing something about the distribution of the duplicate values ... or is this part of the penalty i pay for using a c program that is "hardwired" instead of a comfy perl script ? now if i had p_d at my side im sure he would solve the problem with real programming (ab initio) ... but is it possible to write a perl script that could match the speed of the uniq function called from the List::MoreUtils module? i mean 5 million is a lot of lines, no?

RudyS

[edited by: phranque at 7:31 am (utc) on Oct. 5, 2008]
[edit reason] disabled smileys ;) [/edit]

my %hash = (); $Ś = 1; print "Content-type: text/html\n\nStart..."; open(DATA,"testfile.txt"); flock(DATA,2); while (<DATA>) { $count++ unless ($hash{$_}++); $lines++; } close(DATA); print "Complete, found $count unique lines out of $lines total lines.";