Forum Moderators: coopster
What's the most efficient way to do it?
I have a script working away right now, it's been running for over 6 hours!
Here's what it does:
1) it opens a file.
2) parses all the words with a Regex.
3) loops thru the list of words,
a) SELECT from the database, to see if the word already exists.
b) if not, then it INSERTs the word.
4) loops back to 1) to open the next file.
My database table looks like:
ID - word
---------
1 abalone
2 tweezers
3 patrimony
4 kafuffle
So...
Which is most efficient?
1) SELECT to see if a word is there, then INSERT if it's not found. (do this millions of times)
2) INSERT all the words blindly, then after it's done, remove any multiples. (this would create a monstrously huge table)
3) Use the word itself as a Primary Key. Would I get a database error? or would the database (mySQL) ignore INSERTs that try to use a multiple primary key?
4) ... some other method?
efficiency is key, since as I said, this is parsing millions of words taken from HTML pages, and it will be done rather often (maybe once a day).
just adding UNIQUE to the database column more than doubled the speed of the script. Pages are taking about 3 seconds each, instead of 10-12 seconds.
Then I removed the SELECT query, so it's not checking first and adding conditionally. No errors. mySQL is just ignoring the duplicate words. No noticeable increase in speed however.
Now the script is going through each page in about 2 seconds.
thanks!
Now, I think I would like to add a COUNT to the words, so whenever it's hit, it's incremented by one. Is there a streamlined method for that?
INSERT INTO wordtable (word,count) VALUES ($word,"add one to this, please")
1) insert all available words into temp table - you can use fast bulk load from file into table
2) delete those words in temp table that are already present in main table
3) insert new words (remaining in temp table) into main table
Well, it all depends on actual numbers, you might try this trick as well (if you have few words):
1) select all existing words from table into hash
2) while looping thru file check if word is in hash (faster than queries database)
3) if its not insert into hash AND into table
Other way is a combined approach where by you calculate frequency of words and only keep top 10000 most frequent words in hash optimistically hoping they are there, otherwise insert into database.
The key to database performance is to avoid doing atomic (row by row) operations and try to group the whole lot into sets that you operate with.
[edited by: Lord_Majestic at 7:34 pm (utc) on Aug. 24, 2004]
I originally had "mysql_query($yadda) or DIE($yadda)". I habitually die() after every bad SQL query, in this case I needed to suppress that suicidal urge.
After removing the "die()", the script runs smoothly. the INSERT is ignored when the word exists, I don't die, and the UPDATE goes through every time.
since adding the UNIQUE, it's going really fast. Your solution does it with 2 lines.
Now...
How do I use the ON DUPLICATE KEY UPDATE feature?
The manual shows a "->" with a second line, that makes no sense to me. I'm happy with the script as-is, but I'm also really trying to understand the obscure intricacies of mySQL syntax, and can use all the help i can get.
[edited by: httpwebwitch at 7:45 pm (utc) on Aug. 24, 2004]
INSERT INTO wordtable (word,count) VALUES ('$word',1) ON DUPLICATE KEY UPDATE count = count + 1 Where did you see that in the manual? The INSERT [dev.mysql.com] page describes use of this feature.
Oh wait, I understand, you are looking at that page. That is a command line continuation prompt (printed, no doubt, by the programmer that developed the documentation). You'll see that quite often. MySQL will let you keep keying in and pressing Enter until you terminate the command with either a semicolon or \g [dev.mysql.com].
[edited by: coopster at 7:49 pm (utc) on Aug. 24, 2004]
1) export all current rows from table into file A
2) process your files and append all new words to file A
3) use unix command "uniq" to generate list of unique words into file B
4) delete all rows from current table (this might make your site inaccessible if table is usuall use, in which case you need to use 2 tables one of which will be active at any time and other will be used for updates)
5) insert contents of file B into table
[edited by: Lord_Majestic at 7:52 pm (utc) on Aug. 24, 2004]
A command need not be given all on a single line, so lengthy commands that require several lines are not a problem. mysql determines where your statement ends by looking for the terminating semicolon, not by looking for the end of the input line. (In other words, mysql accepts free-format input: it collects input lines but does not execute them until it sees the semicolon.)Resource:
[dev.mysql.com...]