DATA LOAD INFILE duplicate entries problem

Forum Moderators: coopster

Message Too Old, No Replies

DATA LOAD INFILE duplicate entries problem

Shanee

11:02 pm on Jan 26, 2010 (gmt 0)

Hello,
i am googling from three hours to find a way with the help i can force DATA LOAD INFILE to not import duplicate records. And query stop work in case if any duplicate ID found. Here is my query

$query = "LOAD DATA INFILE 'file.csv' INTO TABLE members FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES";

mysql_query($query) or die(mysql_error());

And Here is mysql table

CREATE TABLE IF NOT EXISTS `members` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`reg_date` date DEFAULT NULL,
`expiry_date` date DEFAULT NULL,
`username` varchar(100) DEFAULT NULL,
`password` varchar(100) DEFAULT NULL,
`username2` varchar(100) DEFAULT NULL,
`second_pwd` varchar(100) DEFAULT NULL,
`fname` varchar(100) DEFAULT NULL,
`mname` varchar(100) DEFAULT NULL,
`lname` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

jatar_k

4:22 pm on Jan 27, 2010 (gmt 0)

how big is the file?

if you're having trouble with LOAD DATA then maybe a quick php script to do the sorting before you insert.

Shanee

8:23 pm on Jan 27, 2010 (gmt 0)

not more than 7000 records.

jatar_k

3:13 pm on Jan 28, 2010 (gmt 0)

you should be able to burn through it I would think

you would need to
open file
get next line
check for duplicate
if dupe continue else insert the row

the slow part would be the check
you could do a query to see if it's in the db already
you could also maintain a list of already inserted keys
you could even use the script to just find duplicates and give you a list then remove them manually or programmatically

I guess it depends on whether there is any pattern to the duplicates, if it's sorted then they may always appear right after each other and you could just keep the last row for comparison. It also would depend on whether you are only doing this once or if it is a constant thing that needs to be done.

I also wonder why there are duplicates in the original data and if this is something that will be done continuously then those issues should also be addressed on the front end.