Welcome to WebmasterWorld Guest from 54.167.157.247

Forum Moderators: open

Message Too Old, No Replies

Writing to normalised database

..as opposed to how to normalise (mysql)

   
2:08 am on Nov 3, 2011 (gmt 0)

10+ Year Member



I need to log about 50K rows per day of data extracted as a CSV from adwords on a daily basis. Since most of it is repetitive keyword / campaign it's prime for normalisation.

I can find loads of posts about design for normalisation, but little about the best way to process your data for an insert.

If I have, say:

tbl_keywords: id,keyword - table of unique keywords
tbl_campaigns: id,campaign - table of unique campaign ids
tbl_performance: id,keyword_id,campaign_id,datapoint

.. and I need to process a new 50K rows of data, creating new keywords and campaigns in tbl_keywords, tbl_campaigns where appropriate and inserting the relevant keys into tbl_performance - what's the best way to do it?

Thanks in advance for your advice.
1:36 pm on Nov 3, 2011 (gmt 0)



Take a look at LOAD DATA INFILE:
[dev.mysql.com...]

IDEA 1:
If you're doing 50K of rows at once, I'd consider breaking the data into CSVs in which the CSV values align with the table columns.


LOAD DATA INFILE 'campaigns.csv'
INTO 'tbl_campaigns'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

LOAD DATA INFILE 'keywords.csv'
INTO 'tbl_keywords'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

LOAD DATA INFILE 'performance.csv'
INTO 'tbl_performance'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';



IDEA 2:
If you can't or don't want to break your 50K-row CSV into individual CSVs, as suggested in idea #1, then you can make an import table.


CREATE TABLE adwords_csv_import (campaign_id INT, keyword VARCHAR(100),datapoint VARCHAR(100);
CREATE INDEX by_keyword ON adwords_csv_import(keyword);

lay out the import table's columns exactly like the CSV.

then load your CSV into that table

LOAD DATA INFILE 'adwords.csv'
INTO 'adwords_csv_import'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';


then insert the new stuff:
SLOW but simple, using a not-in clause

INSERT tbl_keywords (
SELECT adwords_csv_import.keyword
FROM adwords_csv_import
WHERE adwords_csv_import.keyword NOT IN (
SELECT DISTINCT keyword FROM tbl_keywords
)
)


FAST, using a left join

INSERT tbl_keywords (
SELECT adwords_csv_import.keyword
FROM adwords_csv_import
LEFT JOIN tbl_keywords ON tbl_keywords.keyword = adwords_csv_import.keyword
WHERE tbl_keywords.keyword IS NULL
)

and so forth, importing keywords, campaigns, datapoints.

Then truncate the table (quickly delete all rows) when you're done, so it's empty and ready to use next time.

TRUNCATE TABLE adwords_csv_import
9:32 am on Nov 4, 2011 (gmt 0)

10+ Year Member



Thanks for the input, both interesting.

The bit I'm really looking to understand is how to most efficiently write the tbl_keywords.id & tbl_campaigns.id in the tbl_performance.keyword_id / campaign_id columns after updating tbl_keywords/tbl_campaigns.
10:27 am on Nov 4, 2011 (gmt 0)

10+ Year Member



Got it:

1: Load CSV into tmp table with extra columns for normalisation (e.g. keyword_id,campaign_id)

2: INSERT IGNORE INTO tbl_keywords (keyword) SELECT DISTINCT (keyword) FROM tmp

3: UPDATE tmp.keyword_id SET keyword_id= (SELECT.. etc)

And finally populate tbl_performance with new data in tmp, using normalised columns instead of raw data and truncate the tmp table.

All seems to work very fast.