Sounds simple huh? It really is, but try to get it to do this in under an hour is the big hoop to jump. I would prefer to just stuff them in a temp mysql DB but it's not available on this project.
Just pushing them on an array and comparing won't work, it's too huge. Right now I'm fiddling with List::Compare using an intersection, but for some reason I think it may be malfunctioning when I try to read in the large list.
I'm down to reading in a "chunk" of the larger list, comparing the chunk against the smaller list, then storing matches and errors incrementally. Still slow.
How would you approach something like this?
Thinking grep might give me the speed we're after though, that's next on the block. In the meantime I DID build a mysql alternate, the whole thing runs in about 20 minutes.
3 billion comparisons: check each email address in a file of 3000 lines against a file of 1 million. Eak!
In the first I import the files into mySQL and just do a single query. :-) About 8 minutes.
The other I store the smaller file in an array (about 3000 lines) and do a `grep` on the larger file for each item in the array.
For a Pathologically Eclectic Rubbish Lister I would think Perl would have more mechanisms for huge files. In all my years I've just never dealt with text files this large. Guess I always found a better way!
Doesn't diff just compare the files and displays what's different?
diff with -u option uses a unified output format in which all non-matching lines would begin with '+' or '-'.
pipe that into grep with -v option which inverts the match, returning non-matching items, which would therefore be common to both files.
no promises - all i can say is it's worth a try.
(store smaller list in array @arr)
foreach $v (@arr) {
undef($result);
$result = `grep $v $wdir/$target_file`;
if ($result) { $matches++; &log_matches($v); }
}
Then used wc to get the line count. I guess I've been spoiled by mySQL. :-(
I "guestimated" comparing a one million lines file against a second 1 million lines file to take about 12 hours when commands were nested in a program (which is probably due to my lack of proficiency in Unix commands.)
Using a mySQL-only solution - insert into two slim tables, apply different selects on them (joined and not joined-) - I thought would be the fastest, added up to about 50 hours! I say guestimate because I only let these run for half an hour and did the math.
I refused to give up and just say "that's as good as it gets." What I wound up doing was one insert of the first table (error-checking the syntax on the fly,) which tooks 3-5 minutes. Once that table's in, I didn't waste any time on inserting the second. I just opened the second file, cleaned the email up, then dropped it into a select statement
select email from table1 where email='$email'
Which doesn't force you to go through the entire list. At regular intervals, I popped in a value that indicated progress through the second list. As each line was brought up to bat, I also performed error checks for invalid email syntax, and dropped the malcontents into an error log table (one for each list.)
It turned out that this billion computation comparison takes around . . . .
45 minutes! :-)
That's on a Centos single processor machine. Probably would do better on a dual.