Forum Moderators: coopster

Message Too Old, No Replies

adding unique things to a list

using a primary key?

         

httpwebwitch

6:31 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Let's say I have 100,000 pages of text in HTML files.
I want to create a lexicon in my database of all the unique words used.

What's the most efficient way to do it?

I have a script working away right now, it's been running for over 6 hours!

Here's what it does:
1) it opens a file.
2) parses all the words with a Regex.
3) loops thru the list of words,
a) SELECT from the database, to see if the word already exists.
b) if not, then it INSERTs the word.
4) loops back to 1) to open the next file.

My database table looks like:
ID - word
---------
1 abalone
2 tweezers
3 patrimony
4 kafuffle

So...
Which is most efficient?

1) SELECT to see if a word is there, then INSERT if it's not found. (do this millions of times)

2) INSERT all the words blindly, then after it's done, remove any multiples. (this would create a monstrously huge table)

3) Use the word itself as a Primary Key. Would I get a database error? or would the database (mySQL) ignore INSERTs that try to use a multiple primary key?

4) ... some other method?

efficiency is key, since as I said, this is parsing millions of words taken from HTML pages, and it will be done rather often (maybe once a day).

coopster

7:07 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



How about adding a
UNIQUE
keyword to the
word
column and attempting an
INSERT
. If it fails, big deal. Unless, of course, you are counting the number of times the word is found and want to update the total tally by one. Then, upon
INSERT
failure,
UPDATE
the row.

httpwebwitch

7:23 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



woah.
that made a big difference.

just adding UNIQUE to the database column more than doubled the speed of the script. Pages are taking about 3 seconds each, instead of 10-12 seconds.

Then I removed the SELECT query, so it's not checking first and adding conditionally. No errors. mySQL is just ignoring the duplicate words. No noticeable increase in speed however.

Now the script is going through each page in about 2 seconds.

thanks!

Now, I think I would like to add a COUNT to the words, so whenever it's hit, it's incremented by one. Is there a streamlined method for that?

INSERT INTO wordtable (word,count) VALUES ($word,"add one to this, please")

coopster

7:28 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Actually, your insert would be the first row, so
INSERT INTO wordtable (word,count) VALUES ($word,1)
would be appropriate. Then, upon INSERT failure
UPDATE wordtable SET count = count + 1 WHERE word = '$word'
should do the trick.

Lord Majestic

7:31 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you want to make it faster then I suggest do it in two steps:

1) insert all available words into temp table - you can use fast bulk load from file into table
2) delete those words in temp table that are already present in main table
3) insert new words (remaining in temp table) into main table

Well, it all depends on actual numbers, you might try this trick as well (if you have few words):

1) select all existing words from table into hash
2) while looping thru file check if word is in hash (faster than queries database)
3) if its not insert into hash AND into table

Other way is a combined approach where by you calculate frequency of words and only keep top 10000 most frequent words in hash optimistically hoping they are there, otherwise insert into database.

The key to database performance is to avoid doing atomic (row by row) operations and try to group the whole lot into sets that you operate with.

[edited by: Lord_Majestic at 7:34 pm (utc) on Aug. 24, 2004]

httpwebwitch

7:32 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



that's working very well.

I originally had "mysql_query($yadda) or DIE($yadda)". I habitually die() after every bad SQL query, in this case I needed to suppress that suicidal urge.

After removing the "die()", the script runs smoothly. the INSERT is ignored when the word exists, I don't die, and the UPDATE goes through every time.

since adding the UNIQUE, it's going really fast. Your solution does it with 2 lines.

Now...
How do I use the ON DUPLICATE KEY UPDATE feature?
The manual shows a "->" with a second line, that makes no sense to me. I'm happy with the script as-is, but I'm also really trying to understand the obscure intricacies of mySQL syntax, and can use all the help i can get.

[edited by: httpwebwitch at 7:45 pm (utc) on Aug. 24, 2004]

coopster

7:42 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Assuming you are at MySQL >= 4.1.0

INSERT INTO wordtable (word,count) VALUES ('$word',1) ON DUPLICATE KEY UPDATE count = count + 1

Where did you see that in the manual? The INSERT [dev.mysql.com] page describes use of this feature.

Oh wait, I understand, you are looking at that page. That is a command line continuation prompt (printed, no doubt, by the programmer that developed the documentation). You'll see that quite often. MySQL will let you keep keying in and pressing Enter until you terminate the command with either a semicolon or \g [dev.mysql.com].

[edited by: coopster at 7:49 pm (utc) on Aug. 24, 2004]

httpwebwitch

7:48 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



aw dang, mine is 3.23.56

[dead end]

thanks for all your help!

Lord Majestic

7:51 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you are on Unix then you can do this in a batch file:

1) export all current rows from table into file A
2) process your files and append all new words to file A
3) use unix command "uniq" to generate list of unique words into file B
4) delete all rows from current table (this might make your site inaccessible if table is usuall use, in which case you need to use 2 tables one of which will be active at any time and other will be used for updates)
5) insert contents of file B into table

[edited by: Lord_Majestic at 7:52 pm (utc) on Aug. 24, 2004]

coopster

7:51 pm on Aug 24, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You can still use the two-line syntax described earlier though. Also, I found this in the tutorial...


A command need not be given all on a single line, so lengthy commands that require several lines are not a problem. mysql determines where your statement ends by looking for the terminating semicolon, not by looking for the end of the input line. (In other words, mysql accepts free-format input: it collects input lines but does not execute them until it sees the semicolon.)

Resource:
[dev.mysql.com...]