Writing to normalised database

Forum Moderators: open

Message Too Old, No Replies

Writing to normalised database

..as opposed to how to normalise (mysql)

wrightee

2:08 am on Nov 3, 2011 (gmt 0)

I need to log about 50K rows per day of data extracted as a CSV from adwords on a daily basis. Since most of it is repetitive keyword / campaign it's prime for normalisation.

I can find loads of posts about design for normalisation, but little about the best way to process your data for an insert.

If I have, say:

tbl_keywords: id,keyword - table of unique keywords
tbl_campaigns: id,campaign - table of unique campaign ids
tbl_performance: id,keyword_id,campaign_id,datapoint

.. and I need to process a new 50K rows of data, creating new keywords and campaigns in tbl_keywords, tbl_campaigns where appropriate and inserting the relevant keys into tbl_performance - what's the best way to do it?

Thanks in advance for your advice.

Jstanfield

1:36 pm on Nov 3, 2011 (gmt 0)

Take a look at LOAD DATA INFILE:
[dev.mysql.com...]

IDEA 1:
If you're doing 50K of rows at once, I'd consider breaking the data into CSVs in which the CSV values align with the table columns.


LOAD DATA INFILE 'campaigns.csv' 
 INTO 'tbl_campaigns' 
 FIELDS TERMINATED BY ',' 
 OPTIONALLY ENCLOSED BY '"' 
 LINES TERMINATED BY '\n';

LOAD DATA INFILE 'keywords.csv' 
 INTO 'tbl_keywords' 
 FIELDS TERMINATED BY ',' 
 OPTIONALLY ENCLOSED BY '"' 
 LINES TERMINATED BY '\n';

LOAD DATA INFILE 'performance.csv' 
 INTO 'tbl_performance' 
 FIELDS TERMINATED BY ',' 
 OPTIONALLY ENCLOSED BY '"' 
 LINES TERMINATED BY '\n';

IDEA 2:
If you can't or don't want to break your 50K-row CSV into individual CSVs, as suggested in idea #1, then you can make an import table.


CREATE TABLE adwords_csv_import (campaign_id INT, keyword VARCHAR(100),datapoint VARCHAR(100);
CREATE INDEX by_keyword ON adwords_csv_import(keyword);

lay out the import table's columns exactly like the CSV.

then load your CSV into that table


LOAD DATA INFILE 'adwords.csv' 
 INTO 'adwords_csv_import' 
 FIELDS TERMINATED BY ',' 
 OPTIONALLY ENCLOSED BY '"' 
 LINES TERMINATED BY '\n';

then insert the new stuff:
SLOW but simple, using a not-in clause


INSERT tbl_keywords (
 SELECT adwords_csv_import.keyword
 FROM adwords_csv_import
 WHERE adwords_csv_import.keyword NOT IN (
 SELECT DISTINCT keyword FROM tbl_keywords
 )
)

FAST, using a left join


INSERT tbl_keywords (
 SELECT adwords_csv_import.keyword
 FROM adwords_csv_import
 LEFT JOIN tbl_keywords ON tbl_keywords.keyword = adwords_csv_import.keyword
 WHERE tbl_keywords.keyword IS NULL
)

and so forth, importing keywords, campaigns, datapoints.

Then truncate the table (quickly delete all rows) when you're done, so it's empty and ready to use next time.


TRUNCATE TABLE adwords_csv_import

wrightee

9:32 am on Nov 4, 2011 (gmt 0)

Thanks for the input, both interesting.

The bit I'm really looking to understand is how to most efficiently write the tbl_keywords.id & tbl_campaigns.id in the tbl_performance.keyword_id / campaign_id columns after updating tbl_keywords/tbl_campaigns.

wrightee

10:27 am on Nov 4, 2011 (gmt 0)

Got it:

1: Load CSV into tmp table with extra columns for normalisation (e.g. keyword_id,campaign_id)

2: INSERT IGNORE INTO tbl_keywords (keyword) SELECT DISTINCT (keyword) FROM tmp

3: UPDATE tmp.keyword_id SET keyword_id= (SELECT.. etc)

And finally populate tbl_performance with new data in tmp, using normalised columns instead of raw data and truncate the tmp table.

All seems to work very fast.