| iterating through csv to find lines different from another csv
|
tec4

msg:4502439 | 11:34 am on Oct 1, 2012 (gmt 0) | I'm currently iterating through lines in a CSV file to check if each line exists in a second CSV file and if it does not, I store the line in an array so I can prune the "mis-matched" items out of my system. My Problem: This is working fine with smaller CSVs but now that I am working with two CSVs over 85,000 lines, my CPU is spiking and being used 70-85% on this single script and is taking a tremendous amount of time to finish. I am wondering if there is a better way of going about it to make what I am trying to do more efficient. My Code: //Two CSV files $csv = "data.csv"; $csv_local = "local_data.csv"; //Parsing CSV data into arrays $feed_info = parseData($csv); $local_info = parseData($csv_local); //Parsing Function
function parseData($csv_file){ $file_pointer = fopen($csv_file, "r"); $array = array(); while($line = fgets($file_pointer)) { $array[] = trim($line); } return $array; } //Store mis-matched lines in array
if(count($feed_info) > 1 && count($local_info) > 1){ $mis_match_array = array();
foreach($local_info as $info){
if(!in_array($info,$feed_info)){ $mis_match_array[] = $info; } } }
Can't think of a better/less resource intensive way of going about this - any thoughts? Thanks!
|
coopster

msg:4507995 | 7:19 pm on Oct 14, 2012 (gmt 0) | Maybe array_diff and/or array_intersect? [php.net...]
|
swa66

msg:4508009 | 9:24 pm on Oct 14, 2012 (gmt 0) | I guess memory usage is going to cause your system to trash You could try to load just one file , and parse the other line by line without loading it all in memory (parse it yourself) , that should cut down your memory usage by about half.
|
penders

msg:4508486 | 4:36 pm on Oct 16, 2012 (gmt 0) | I would certainly look at the array functions, as coopster suggests. Another alternative is to store just one file "data.csv" as the keys of an array (not the values), then step through "local_data.csv" line by line (don't read into memory in its entirety) and check for its presence in the array using isset() - this is much more efficient than using in_array().
|
tec4

msg:4509908 | 4:58 pm on Oct 19, 2012 (gmt 0) | Okay, awesome. Thank you all for your ideas and input. I'm going to give it a go later today and see how it improves.
|
|
|