Welcome to WebmasterWorld Guest from 54.147.44.13

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

iterating through csv to find lines different from another csv

     
11:34 am on Oct 1, 2012 (gmt 0)

Junior Member

joined:June 16, 2011
posts: 79
votes: 0


I'm currently iterating through lines in a CSV file to check if each line exists in a second CSV file and if it does not, I store the line in an array so I can prune the "mis-matched" items out of my system.

My Problem: This is working fine with smaller CSVs but now that I am working with two CSVs over 85,000 lines, my CPU is spiking and being used 70-85% on this single script and is taking a tremendous amount of time to finish. I am wondering if there is a better way of going about it to make what I am trying to do more efficient.

My Code:

//Two CSV files
$csv = "data.csv";
$csv_local = "local_data.csv";

//Parsing CSV data into arrays
$feed_info = parseData($csv);
$local_info = parseData($csv_local);

//Parsing Function

function parseData($csv_file){
$file_pointer = fopen($csv_file, "r");

$array = array();
while($line = fgets($file_pointer)) {
$array[] = trim($line);
}
return $array;

}



//Store mis-matched lines in array

if(count($feed_info) > 1 && count($local_info) > 1){

$mis_match_array = array();

foreach($local_info as $info){

if(!in_array($info,$feed_info)){

$mis_match_array[] = $info;

}
}
}


Can't think of a better/less resource intensive way of going about this - any thoughts?

Thanks!
7:19 pm on Oct 14, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12533
votes: 0


Maybe array_diff and/or array_intersect?
[php.net...]
9:24 pm on Oct 14, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member swa66 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 7, 2003
posts:4783
votes: 0


I guess memory usage is going to cause your system to trash

You could try to load just one file , and parse the other line by line without loading it all in memory (parse it yourself) , that should cut down your memory usage by about half.
4:36 pm on Oct 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3123
votes: 0


I would certainly look at the array functions, as coopster suggests.

Another alternative is to store just one file "data.csv" as the keys of an array (not the values), then step through "local_data.csv" line by line (don't read into memory in its entirety) and check for its presence in the array using isset() - this is much more efficient than using in_array().
4:58 pm on Oct 19, 2012 (gmt 0)

Junior Member

joined:June 16, 2011
posts: 79
votes: 0


Okay, awesome. Thank you all for your ideas and input. I'm going to give it a go later today and see how it improves.