Welcome to WebmasterWorld Guest from 23.20.137.66

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

iterating through csv to find lines different from another csv

     
11:34 am on Oct 1, 2012 (gmt 0)



I'm currently iterating through lines in a CSV file to check if each line exists in a second CSV file and if it does not, I store the line in an array so I can prune the "mis-matched" items out of my system.

My Problem: This is working fine with smaller CSVs but now that I am working with two CSVs over 85,000 lines, my CPU is spiking and being used 70-85% on this single script and is taking a tremendous amount of time to finish. I am wondering if there is a better way of going about it to make what I am trying to do more efficient.

My Code:

//Two CSV files
$csv = "data.csv";
$csv_local = "local_data.csv";

//Parsing CSV data into arrays
$feed_info = parseData($csv);
$local_info = parseData($csv_local);

//Parsing Function

function parseData($csv_file){
$file_pointer = fopen($csv_file, "r");

$array = array();
while($line = fgets($file_pointer)) {
$array[] = trim($line);
}
return $array;

}



//Store mis-matched lines in array

if(count($feed_info) > 1 && count($local_info) > 1){

$mis_match_array = array();

foreach($local_info as $info){

if(!in_array($info,$feed_info)){

$mis_match_array[] = $info;

}
}
}


Can't think of a better/less resource intensive way of going about this - any thoughts?

Thanks!
7:19 pm on Oct 14, 2012 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Maybe array_diff and/or array_intersect?
[php.net...]
9:24 pm on Oct 14, 2012 (gmt 0)

WebmasterWorld Senior Member swa66 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I guess memory usage is going to cause your system to trash

You could try to load just one file , and parse the other line by line without loading it all in memory (parse it yourself) , that should cut down your memory usage by about half.
4:36 pm on Oct 16, 2012 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



I would certainly look at the array functions, as coopster suggests.

Another alternative is to store just one file "data.csv" as the keys of an array (not the values), then step through "local_data.csv" line by line (don't read into memory in its entirety) and check for its presence in the array using isset() - this is much more efficient than using in_array().
4:58 pm on Oct 19, 2012 (gmt 0)



Okay, awesome. Thank you all for your ideas and input. I'm going to give it a go later today and see how it improves.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month