homepage Welcome to WebmasterWorld Guest from 54.166.113.249
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderator: open

Databases Forum

    
Massive Update Statements
webfoo

5+ Year Member



 
Msg#: 4598004 posted 1:03 pm on Jul 31, 2013 (gmt 0)

I have a flat file transferred to my server every morning via FTP. It is the catalog for our supplier, containing over 300,000 items. I have a PHP script that works exceptionally well for importing it and pairing the fields to match my MySQL database columns. This whole process takes less than a minute.

BUT, the next morning, when its time to update the data, it takes FOREVER (upwards of 40 hours, assuming nothing times out). The script goes line by line through the flat file, finds the associated database record (if there is one) and update that record with the new information. If it doesn't exist in the database, I have to insert it.

There has to be a more efficient way to perform a massive update like this. What am I missing?

 

londrum

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4598004 posted 1:17 pm on Jul 31, 2013 (gmt 0)

if it only takes one minute to import it, and you get a new file every day, can't you just completely empty the database and re-import the new one?

if you don't want the site to be down for a minute then maybe you could have two identical databases (with different names). once you've imported the new file to the second database switch over which database your script connects to. then swap back and forth each day

webfoo

5+ Year Member



 
Msg#: 4598004 posted 1:40 pm on Jul 31, 2013 (gmt 0)

Londrum, I was considering truncating the table and re-importing all 300,000 records because of how fast it is. However I decided against it because each item in the table has a auto_increment primary index that other tables reference. For example, the orders table might say that a customer ordered item 90461. If I truncate the products table every morning, and then add all the rows back in to it, item 90461 is not necessarily going to point to the same product it was before. Now we don't know which item the customer actually ordered.

In order to make this feasible, I'd need to reference each product with a dual key of supplier number and SKU number, from within the orders table. Any other thoughts before I rip apart my database design?

londrum

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4598004 posted 1:52 pm on Jul 31, 2013 (gmt 0)

instead of comparing today's file to the database, maybe you could use php to compare it to yesterday's file instead. just search for each product name (or whatever) and if the line is different stick it in an 'update script'. if its new then stick it in an 'import script'. then get php to run both scripts. that will reduce the load on your database by loads, i would imagine.

it will probably still take php a long time to compare the files (considering that it will have to loop through 300,000 lines 300,000 times, but you might be able to knock a load of hours off.

i would imagine that the line numbers for each product are roughly similar from day-to-day (give or take a thousand lines, or whatever), so you should start the script looking at a particular line, based on what it's current line is. and stop it looking one it's done a thousand lines (or whatever). that way you can reduce the time even more -- you'll only have to loop through 300,000 lines 1000 times. in fact, that might be a good way to reduce the time taken when looking through the database as well, if you decide to stick with that way. don't let it look through the entire database, just the 1000 lines each side of its current line number.

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4598004 posted 9:24 pm on Jul 31, 2013 (gmt 0)

When doing the update, what values are you using to find the record in the database?
Is there an index on these values?
If the answer is no, that's where your performance problem is as it would be doing a full table scan each time to try and find the matching record.
The solution would be to create an index on those values.

DrDoc

WebmasterWorld Senior Member drdoc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4598004 posted 9:28 pm on Jul 31, 2013 (gmt 0)

Import the flatfile into a (well indexed) temporary table, then use a single query and well designed table
JOINs to update the data.
bhukkel



 
Msg#: 4598004 posted 9:28 pm on Jul 31, 2013 (gmt 0)

Perhaps you can use the "insert... on duplicate key update ..." statement of mysql?

JD_Toims

WebmasterWorld Senior Member Top Contributors Of The Month



 
Msg#: 4598004 posted 12:10 am on Aug 1, 2013 (gmt 0)

I'd need to reference each product with a dual key of supplier number and SKU number

This sounds like the way to go...

I'm not sure if this is what you're doing currently or not, but it seems like you could index the supplier & sku cols (I'd set the supplier to unique unless they both will be unique, otherwise, unique on the supplier and index on the sku) in the database, grab the current supplier & sku from the flat file and update based on those or insert if they weren't present.

I'd personally try looping through with: fgets, mysqyi_query [(UPDATE table USE INDEX (supplier,sku) SET (cols) VALUES ($from_fgets) WHERE supplier=$current_supplier AND sku=$current_sku);], then mysqli_affected_rows in a conditional [if(mysqli_affected_rows()<1) { mysqli_query(INSERT INTO table (cols) VALUES ($from_fgets);}], so all you're doing is looping through the file one line at a time checking to see if a row was updated and inserting if it wasn't.

* Sorry for the spaghetti, not feeling like actually coding right now.

webfoo

5+ Year Member



 
Msg#: 4598004 posted 12:38 pm on Aug 1, 2013 (gmt 0)

I'd need to reference each product with a dual key of supplier number and SKU number
This sounds like the way to go...


Yes this is ultimately what I decided to do. SKUs are unique within a supplier. So if the orders table says that a customer ordered SKU 30564 from suppler 9, it will always be the same product, regardless of which row id that product currently sits in.

The entire process of FTPing the file from the supplier, unzipping it, erasing the old records, and inserting the new records takes only 2 or 3 minutes. This is an amazing improvement on 40 hours. It could probably be made more efficient, but it is good enough for the purposes of this system. This process is scheduled to run at off-peak hours in the middle of the night so 2 minutes of downtime is not an issue. The system is only used during business hours anyway.

brotherhood of LAN

WebmasterWorld Administrator brotherhood_of_lan us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4598004 posted 3:26 pm on Aug 1, 2013 (gmt 0)

An alternative...

> supplier number and SKU number as PK

Have your permanent product table, a hash table and an insert table

hashtable would contain suppliernumber/SKU as primary key and a HASH index. It would also have your integer primary key from the permanent table, as a reference. Ideally here you would use a MEMORY table with a HASH index as those are very quick, but a MyISAM table (preferably using LOAD INDEX INTO CACHE) would be quick too.

- load data into insert table

- delete from insert table by referencing the hash table for existing products. join to permanent table if you want to do some updates first. this should be very quick depending on how much writing needs to be done, but since you're joining on the primary key of both tables... it'll be quick. You'll be left with only new products.

- SET @newincrement = (SELECT MAX(id) FROM permanenttable) to get your next auto increment value, which can be used to insert into the hash table too.

- insert new products into hash & permanent table

I do a similar process for some rank checking. The permanent table has 2 billion records and incoming data is between 200K and 1M rows. A fixed length reference for discerning new data works out to be quick, and whatever spare memory you can use also helps.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved