this reply is obliquely relevant to this post [webmasterworld.com], which relates to a specific question i have ... i have a file with over 5 million lines containing exactly 36 characters (only [ACGT]) in each line ... i want to obtain information about the number of replicate lines ... so i found a MODULE at cpan (List::MoreUtils) that provides the code for the function "uniq" and used it in the following "testuniq" script :
-----------------------------
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( uniq );
open (SEQS, '<5Mseqs');
print "this is the beginning of the 5meg unique test\n";
my @array = (<SEQS>);
my $uniquecount = uniq( @array );
print "these are the numer of unique lines in the 5,130,912 line file: $uniquecount \n";
print "this is the end\n";
---------------------------
in less than 60 seconds it printed the following:
------------------------
[root@N1233 List-MoreUtils-0.22]# perl testuniq
this is the beginning of the 5meg unique test
these are the numer of unique lines in the 5,130,912 line file: 4342967
this is the end
------------------------
compared to my own amateurish script without the "uniq" function this was amazingly efficient ... however, now i want some more information and am stymied about how to get it ... from the cpan page:
-------------------------
uniq LIST
Returns a new list by stripping duplicate values in LIST. The order of elements in the returned list is the same as in LIST. In scalar context, returns the number of unique elements in LIST.
my @x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 1 2 3 5 4
my $x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 5
--------------------------
but what about the the "duplicate values in LIST" ... i am actually very interested in knowing something about the distribution of the duplicate values ... or is this part of the penalty i pay for using a c program that is "hardwired" instead of a comfy perl script ? now if i had p_d at my side im sure he would solve the problem with real programming (ab initio) ... but is it possible to write a perl script that could match the speed of the uniq function called from the List::MoreUtils module? i mean 5 million is a lot of lines, no?
RudyS
[edited by: phranque at 7:31 am (utc) on Oct. 5, 2008]
[edit reason] disabled smileys ;) [/edit]
The following simple code will accomplish what you want in under 15 seconds on a 5 million line file where each line is 36 chars (tested on a low end server that was already loaded).
my %hash = (); $¦ = 1;
print "Content-type: text/html\n\n<p>Start...</p>";
open(DATA,"testfile.txt"); flock(DATA,2);
while (<DATA>) { $count++ unless ($hash{$_}++); $lines++; }
close(DATA);
print "<p>Complete, found <b>$count</b> unique lines out of <b>$lines</b> total lines.</p>";
You can examine the values in the resulting %hash to see where the dupes turn up.
[edited by: phranque at 7:31 am (utc) on Oct. 5, 2008]
[edit reason] disabled smileys ;) [/edit]
[root@N1233 List-MoreUtils-0.22]# perl testuniq
Content-type: text/html
<p>Start...
</p>
<p>Complete, found <b>4342967</b> unique lines out of <b>5130912</b> total lines.
</p>
the only mechanical glitch was copying the "pipe" character "¦" from this page to the vi editor ... just had to enter it with the local keyboard ...
now i "just" have to figure out what you meant about the %hash contents ... but you certainly convinced me that going to modules for shortcuts is no good replacement for writing intelligent scripts ... being a scientist i want to have control of my tools in case something interesting pops up ... unfortunately i came to bioinformatics late and am an amateur scripter ... but i will remember this experience!
RudyS
now i "just" have to figure out what you meant about the %hash contents
I'm not sure how far into perl scripting you've gotten so... A hash is a matrix. In this case it's a matrix where the rows (keys) are unique lines and the columns (values) are the number of times the data was repeated.
So lines that only appeared once would have a value of 1, lines that appear twice 2, etc...
%hash = ();
open(SEQS,"8lines");
while (<SEQS>) { $count++ unless ($hash{$_}++); $lines++; }
close(DATA);
print "found $count unique lines out of $lines total lines.\n";
print "unique lines as a fraction of total lines is: ";
$fractionunique = ($count / $lines);
printf '%.4f', "$fractionunique";
print "\n" ;
@allkeys = keys(%hash);
foreach $thekey (@allkeys)
{
print "$thekey";
}
print "\n";
@allvalues = values(%hash);
foreach $thevalue (@allvalues)
{
print "$thevalue"
}
-------------------------
it printed out:
c:\perl\bin>perl unique8get
found 5 unique lines out of 8 total lines.
unique lines as a fraction of total lines is: 0.6250
1
2
5
3
4
22121
--------------
so the keys are the original numbers in the file ... all the ones that were unique ... and the values are the number of occurences of each key ... what i am working toward is getting a list of occurences ... i want to find out what are the most highly repeated sequences in my original 5million line file ... its a little confusing to think of the sequences as the keys and the values as the number of occurences but it makes sense ... otherwise the trick of filling a hash to detect duplicates wouldnt work
will keep you updated
rudyS
[edited by: phranque at 2:37 am (utc) on Oct. 9, 2008]
[edit reason] disabled smileys ;) [/edit]
added it to your previous and it worked fine ... even on my little laptop here it gave me the answer from the 5megaline file ... top of the output was most valuable ...
read oversampling
CATACCATAGTAATAGTTATACACCGAAAAAGAAAT343
GCTATCAAAGAAAATTTCCCAAATTTGAAAAAGATG197
CATCCAACTTTGTTCTATTTCTTTGATTTGAAATAA173
GTTATCAGGGAATAGCTGCATACCATAGTAATAGTT172
GATTGTAAATAAGACAGAAAAAATGCTTTTACAATC163
CATAAATCAACAAAATCTTTATCTGGTAAGGGATTT163
CATAAATTGACTAAATCCTTCTCTGGCAAGAGACTT157
all of these reads are from a known transposon in mycobacteria ... with the top few hundred reads i should be able to assemble a transposon representative of this species ...
so you contributed to public health ... maybe ... we thank you
rudyS