Welcome to WebmasterWorld Guest from 54.226.194.180

Forum Moderators: open

Message Too Old, No Replies

Writing to normalised database

..as opposed to how to normalise (mysql)

     

wrightee

2:08 am on Nov 3, 2011 (gmt 0)

10+ Year Member



I need to log about 50K rows per day of data extracted as a CSV from adwords on a daily basis. Since most of it is repetitive keyword / campaign it's prime for normalisation.

I can find loads of posts about design for normalisation, but little about the best way to process your data for an insert.

If I have, say:

tbl_keywords: id,keyword - table of unique keywords
tbl_campaigns: id,campaign - table of unique campaign ids
tbl_performance: id,keyword_id,campaign_id,datapoint

.. and I need to process a new 50K rows of data, creating new keywords and campaigns in tbl_keywords, tbl_campaigns where appropriate and inserting the relevant keys into tbl_performance - what's the best way to do it?

Thanks in advance for your advice.

Jstanfield

1:36 pm on Nov 3, 2011 (gmt 0)



Take a look at LOAD DATA INFILE:
[dev.mysql.com...]

IDEA 1:
If you're doing 50K of rows at once, I'd consider breaking the data into CSVs in which the CSV values align with the table columns.


LOAD DATA INFILE 'campaigns.csv'
INTO 'tbl_campaigns'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

LOAD DATA INFILE 'keywords.csv'
INTO 'tbl_keywords'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

LOAD DATA INFILE 'performance.csv'
INTO 'tbl_performance'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';



IDEA 2:
If you can't or don't want to break your 50K-row CSV into individual CSVs, as suggested in idea #1, then you can make an import table.


CREATE TABLE adwords_csv_import (campaign_id INT, keyword VARCHAR(100),datapoint VARCHAR(100);
CREATE INDEX by_keyword ON adwords_csv_import(keyword);

lay out the import table's columns exactly like the CSV.

then load your CSV into that table

LOAD DATA INFILE 'adwords.csv'
INTO 'adwords_csv_import'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';


then insert the new stuff:
SLOW but simple, using a not-in clause

INSERT tbl_keywords (
SELECT adwords_csv_import.keyword
FROM adwords_csv_import
WHERE adwords_csv_import.keyword NOT IN (
SELECT DISTINCT keyword FROM tbl_keywords
)
)


FAST, using a left join

INSERT tbl_keywords (
SELECT adwords_csv_import.keyword
FROM adwords_csv_import
LEFT JOIN tbl_keywords ON tbl_keywords.keyword = adwords_csv_import.keyword
WHERE tbl_keywords.keyword IS NULL
)

and so forth, importing keywords, campaigns, datapoints.

Then truncate the table (quickly delete all rows) when you're done, so it's empty and ready to use next time.

TRUNCATE TABLE adwords_csv_import

wrightee

9:32 am on Nov 4, 2011 (gmt 0)

10+ Year Member



Thanks for the input, both interesting.

The bit I'm really looking to understand is how to most efficiently write the tbl_keywords.id & tbl_campaigns.id in the tbl_performance.keyword_id / campaign_id columns after updating tbl_keywords/tbl_campaigns.

wrightee

10:27 am on Nov 4, 2011 (gmt 0)

10+ Year Member



Got it:

1: Load CSV into tmp table with extra columns for normalisation (e.g. keyword_id,campaign_id)

2: INSERT IGNORE INTO tbl_keywords (keyword) SELECT DISTINCT (keyword) FROM tmp

3: UPDATE tmp.keyword_id SET keyword_id= (SELECT.. etc)

And finally populate tbl_performance with new data in tmp, using normalised columns instead of raw data and truncate the tmp table.

All seems to work very fast.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month