Forum Moderators: coopster & phranque

Message Too Old, No Replies

analyzing distribution of the duplicate values in a list

using List::MoreUtils perl module?

         

RudyS

8:21 pm on Oct 4, 2008 (gmt 0)

10+ Year Member



hi guys

this reply is obliquely relevant to this post [webmasterworld.com], which relates to a specific question i have ... i have a file with over 5 million lines containing exactly 36 characters (only [ACGT]) in each line ... i want to obtain information about the number of replicate lines ... so i found a MODULE at cpan (List::MoreUtils) that provides the code for the function "uniq" and used it in the following "testuniq" script :
-----------------------------
#!/usr/bin/perl

use strict;
use warnings;
use List::MoreUtils qw( uniq );
open (SEQS, '<5Mseqs');
print "this is the beginning of the 5meg unique test\n";
my @array = (<SEQS>);
my $uniquecount = uniq( @array );
print "these are the numer of unique lines in the 5,130,912 line file: $uniquecount \n";
print "this is the end\n";
---------------------------

in less than 60 seconds it printed the following:

------------------------

[root@N1233 List-MoreUtils-0.22]# perl testuniq
this is the beginning of the 5meg unique test
these are the numer of unique lines in the 5,130,912 line file: 4342967
this is the end

------------------------

compared to my own amateurish script without the "uniq" function this was amazingly efficient ... however, now i want some more information and am stymied about how to get it ... from the cpan page:
-------------------------
uniq LIST

Returns a new list by stripping duplicate values in LIST. The order of elements in the returned list is the same as in LIST. In scalar context, returns the number of unique elements in LIST.

my @x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 1 2 3 5 4
my $x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 5

--------------------------

but what about the the "duplicate values in LIST" ... i am actually very interested in knowing something about the distribution of the duplicate values ... or is this part of the penalty i pay for using a c program that is "hardwired" instead of a comfy perl script ? now if i had p_d at my side im sure he would solve the problem with real programming (ab initio) ... but is it possible to write a perl script that could match the speed of the uniq function called from the List::MoreUtils module? i mean 5 million is a lot of lines, no?

RudyS

[edited by: phranque at 7:31 am (utc) on Oct. 5, 2008]
[edit reason] disabled smileys ;) [/edit]

IanKelley

11:18 pm on Oct 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great example of a module being an inefficient way to go.

The following simple code will accomplish what you want in under 15 seconds on a 5 million line file where each line is 36 chars (tested on a low end server that was already loaded).


my %hash = (); $¦ = 1;
print "Content-type: text/html\n\n<p>Start...</p>";
open(DATA,"testfile.txt"); flock(DATA,2);
while (<DATA>) { $count++ unless ($hash{$_}++); $lines++; }
close(DATA);
print "<p>Complete, found <b>$count</b> unique lines out of <b>$lines</b> total lines.</p>";

You can examine the values in the resulting %hash to see where the dupes turn up.

[edited by: phranque at 7:31 am (utc) on Oct. 5, 2008]
[edit reason] disabled smileys ;) [/edit]

RudyS

1:59 pm on Oct 5, 2008 (gmt 0)

10+ Year Member



wow IanKelley ... looks like a lean mean script ... will test tomorrow and let you know ... thanks and glad i was able to help prove your point
RudyS

RudyS

6:08 pm on Oct 5, 2008 (gmt 0)

10+ Year Member



iankelley you are sooo right ... 11 seconds start to finish ...

[root@N1233 List-MoreUtils-0.22]# perl testuniq
Content-type: text/html

<p>Start...
</p>
<p>Complete, found <b>4342967</b> unique lines out of <b>5130912</b> total lines.
</p>

the only mechanical glitch was copying the "pipe" character "¦" from this page to the vi editor ... just had to enter it with the local keyboard ...

now i "just" have to figure out what you meant about the %hash contents ... but you certainly convinced me that going to modules for shortcuts is no good replacement for writing intelligent scripts ... being a scientist i want to have control of my tools in case something interesting pops up ... unfortunately i came to bioinformatics late and am an amateur scripter ... but i will remember this experience!

RudyS

phranque

8:07 am on Oct 6, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



from your example data in your OP, $hash->{1} would contain 2, $hash->{5} would contain 1, etc...

IanKelley

6:41 pm on Oct 6, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



now i "just" have to figure out what you meant about the %hash contents

I'm not sure how far into perl scripting you've gotten so... A hash is a matrix. In this case it's a matrix where the rows (keys) are unique lines and the columns (values) are the number of times the data was repeated.

So lines that only appeared once would have a value of 1, lines that appear twice 2, etc...

RudyS

7:11 pm on Oct 6, 2008 (gmt 0)

10+ Year Member



thanks iankelley/phranque
i phollowed phranque's comment and pasted the numbers
1, 1, 2, 2, 3, 5, 3, 4
into 8 lines in a file and ran the following script to try to understand how a hash works
---------------------------------------
#!/usr/bin/perl

%hash = ();
open(SEQS,"8lines");
while (<SEQS>) { $count++ unless ($hash{$_}++); $lines++; }
close(DATA);
print "found $count unique lines out of $lines total lines.\n";
print "unique lines as a fraction of total lines is: ";
$fractionunique = ($count / $lines);
printf '%.4f', "$fractionunique";
print "\n" ;
@allkeys = keys(%hash);
foreach $thekey (@allkeys)
{
print "$thekey";
}
print "\n";
@allvalues = values(%hash);
foreach $thevalue (@allvalues)
{
print "$thevalue"
}

-------------------------
it printed out:

c:\perl\bin>perl unique8get
found 5 unique lines out of 8 total lines.
unique lines as a fraction of total lines is: 0.6250
1
2
5
3
4

22121

--------------
so the keys are the original numbers in the file ... all the ones that were unique ... and the values are the number of occurences of each key ... what i am working toward is getting a list of occurences ... i want to find out what are the most highly repeated sequences in my original 5million line file ... its a little confusing to think of the sequences as the keys and the values as the number of occurences but it makes sense ... otherwise the trick of filling a hash to detect duplicates wouldnt work

will keep you updated

rudyS

[edited by: phranque at 2:37 am (utc) on Oct. 9, 2008]
[edit reason] disabled smileys ;) [/edit]

IanKelley

7:51 pm on Oct 6, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This (untested) code should produce a sorted list based on frequency of occurence:


foreach $k (sort {$hash{$b} <=> $hash{$a}} keys %hash)
{
print "$k\t$hash{$k}\n";
}

Swap $a and $b to sort ascending.

[edit: Updated for command line output]

[edited by: IanKelley at 8:03 pm (utc) on Oct. 6, 2008]

RudyS

2:58 am on Oct 7, 2008 (gmt 0)

10+ Year Member



nice iankelley

added it to your previous and it worked fine ... even on my little laptop here it gave me the answer from the 5megaline file ... top of the output was most valuable ...

read oversampling
CATACCATAGTAATAGTTATACACCGAAAAAGAAAT343
GCTATCAAAGAAAATTTCCCAAATTTGAAAAAGATG197
CATCCAACTTTGTTCTATTTCTTTGATTTGAAATAA173
GTTATCAGGGAATAGCTGCATACCATAGTAATAGTT172
GATTGTAAATAAGACAGAAAAAATGCTTTTACAATC163
CATAAATCAACAAAATCTTTATCTGGTAAGGGATTT163
CATAAATTGACTAAATCCTTCTCTGGCAAGAGACTT157

all of these reads are from a known transposon in mycobacteria ... with the top few hundred reads i should be able to assemble a transposon representative of this species ...

so you contributed to public health ... maybe ... we thank you

rudyS