Welcome to WebmasterWorld Guest from 3.92.92.168

Forum Moderators: open

Message Too Old, No Replies

MySQL: check if colB exists, and if so return colA

     
6:47 am on Sep 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


I have a table with 2 columns:

ref_id: INT(11) (PRIMARY, AutoIncrement)
message: TEXT

Column "message" holds the actual message from Private Messages, so it can be a very long field.

I'm trying to change my structure so that I have one table (tableA) for id, ref_id, username, sender, date, and subject, and then another table (tableB) for ref_id and message. Then, ref_id in the first table will reference ref_id in the second table to get the message. This should cut my database size in less than half (from about 10 million rows currently, and 2.3GB, to less than 5 million rows).

So the question is, when I'm adding a new message, what's the best way to see if the message already exists, and if so then return the ref_id?

The only way I can think to do it is to put "message" in an index of its own, then do:

$query = sprintf("SELECT ref_id FROM tableB WHERE message = '%s' LIMIT 1"...);
$sth = mysqli_query($dbh, $query);


where %s equals the message submitted.

Then, do:

if ($sth && mysqli_num_rows($sth)) > 0) {
list($ref_id) = mysqli_fetch_row($sth);

// insert in to tableA
}

else {
// insert in to tableB with AutoIncrement, and then
// insert in to tableA with LAST_INSERT_ID()
}


The problem with that, other than multiple queries, is that the index for "message" is going to be huge! And I worry that the SELECT query will take a long time if it has to compare 5 million rows, so it would totally defeat the purpose of me splitting it in to two tables in the first place.

Any better suggestions? Query speed is the biggest concern here.
8:57 am on Sept 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Took me a while, but I think I now understand that there are just lots of duplicate messages in the database? If these are indeed exact duplicates, have you considered creating (and storing) a hash of each message? It'll require some additional space, but the index will be much smaller, and look-ups should be less complicated.

Having said that, is saving roughly 1GB of space worth the hassle? 1GB is not all that much for MySQL.
9:24 am on Sept 24, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:5046
votes: 60


A hash sounds like a good way to address it
6:28 am on Sept 25, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


I'm trying to reply but I keep getting a 403 Forbidden error...
7:18 pm on Sept 25, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


I first wrote this in 2012, when I was brand spanking new to MySQL. The current database has 8 columns:

id
username
sender
status
folder
postdate
subject
message


(I had the types in there, but that was causing the post to throw an error... sorry!)

Each time a message is sent it saves 2 rows: the first row would be saved with the recipient's username under "username", the sender's username under "sender", folder as "inbox", and status as "New". Then the second row would be saved with the sender's username as "username", the recipient's username as "sender", folder as "sent", and status as NULL.

So in 99.9999% of the cases, the same message is saved in two rows.

My plan was to cut the number of rows in half using a second table. It will also be marginally smaller when I change "status" to a TINYINT(1) and then make it either 0 or 1 instead of null or "New".

There are other times when someone might send the same message multiple times; eg, someone might send the same message to 50+ people in the Personals, just copy and pasting over and over.

The problem I'm trying to resolve is that some accounts with a lot of rows are running particularly slow, even though it seems to be indexed properly. For example, my admin account has 7,348 rows where folder=inbox, but I just ran a query that took 9.4454s:

SELECT * FROM private_messages WHERE username = 'example' AND folder = 'inbox' ORDER BY postdate DESC


It uses an index that contains username, status, and folder. Using Explain shows that "key_len" is 104, "rows" is 5706, and "extra" says "Using where; Using filesort". I'm guessing that a new index that didn't include "status" might load faster, but last time I tried to add an index it took too long and ended up crashing the database...

Knowing this, do you think it's still the most efficient to store 2 rows for each message like that, instead of creating a separate table? If so, do you think that it's still a good choice to store "message" as a hash?

Assuming that I should move forward with a hash, this is a new one for me... I find people storing passwords in a hash, is it the same concept?

[dev.mysql.com...]
8:41 am on Sept 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Each time a message is sent it saves 2 rows: the first row would be saved with the recipient's username under "username", the sender's username under "sender", folder as "inbox", and status as "New". Then the second row would be saved with the sender's username as "username", the recipient's username as "sender", folder as "sent", and status as NULL.

Wouldn't the first row be sufficient to put the same message into the proper folders of both sender and recipient, if you disregard the "folder" column? If the logged in user is in the "username" column, it's an inbox message, if he/she is in the "sender" column, it's a sent message. It'll get more complicated if you have more types of folders, perhaps even custom ones, but if it's just inbox and sent then all you'd need to change is a bit of code and you could remove all duplicate rows.

If it is more complex (or going to be), or you just want to put things right, I would indeed put all the messages in a separate table, and then put things together with link tables, e.g. to link a message to a user and a folder. That way you could avoid repeating a message in storage that's sent to multiple recipients, and it would certainly speed things along. Ideally you would also store user IDs along with each message, rather than repeating the full username every time.

It uses an index that contains username, status, and folder. Using Explain shows that "key_len" is 104, "rows" is 5706, and "extra" says "Using where; Using filesort". I'm guessing that a new index that didn't include "status" might load faster, but last time I tried to add an index it took too long and ended up crashing the database...

Note that the order of columns in combined indexes is important. If you put 1 index on the 3 columns "username", "status" and "folder" (in that specific order), then this will only help with queries that select first by "username", then (optionally) by "status" and then (optionally) by "folder". With that index, your query of "WHERE username = 'example' AND folder = 'inbox'" is only going to be able to use the "username" part of your index, and so the "folder = 'inbox'" part will be relatively slow. You use combined indexes only when you usually query the data in the same way and order. If you query it in many different ways, indexes on individual columns are probably a better choice. If that crashes your database, have you checked your error log? Perhaps you're running out of memory, or there are too many users altering the database at the same time and some downtime would be required.

Assuming that I should move forward with a hash, this is a new one for me... I find people storing passwords in a hash, is it the same concept?

Sort of, but with passwords it's about encryption so that you don't have to store the actual password. In your case it would be creating a hash string that uniquely identifies each message, the same way that a CRC checksum is often used to verify that a file is unchanged (if the checksum is different than expected, the file was changed). So with every new message, you could have the hash string calculated, and then search the database for that hash to see if an exact match of that message already exists. All so that you don't have to put the complete message in a WHERE clause.

However, if you were to store messages in a separate table, then you wouldn't really need hashes; they could still be used if you have a lot of duplicate messages, but once you've de-duplicated the message rows and have the ability to link a single message to multiple recipients, I doubt that it would really give you any benefit.

All in all, it sounds like a proper rewrite of the whole database structure is in order :-)
7:56 pm on Sept 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


Wouldn't the first row be sufficient to put the same message into the proper folders of both sender and recipient, if you disregard the "folder" column?

In 99.9999% of the cases it would, but I have a very few exceptions.

In order to prevents scammers, hackers, and viruses, I have a handful of filters on new accounts that check for certain keywords or IP addresses that seem fishy. Maybe once a week or two, I'll wake up to see that a new account tried to send a few hundred messages, and they were all filtered! But I don't want the scammer to see that they didn't go through, or they'll just keep revising their verbiage until they find a way through.

So because of those relatively rare instances, I have them in 2 separate rows.

It'll get more complicated if you have more types of folders, perhaps even custom ones...

That's exactly how it is... I let the user create their own folders, so many users have dozens of folders! The only limitation is the length is set to 50.

Ideally you would also store user IDs along with each message, rather than repeating the full username every time.

Yeah, I really wish I'd set up an ID for each username in the beginning! Doing it now gets a bit complicated because I have to modify so many programs, but I do think it will make things run faster... so I'll probably just get on it and do it.

Note that the order of columns in combined indexes is important. If you put 1 index on the 3 columns "username", "status" and "folder" (in that specific order), then this will only help with queries that select first by "username", then (optionally) by "status" and then (optionally) by "folder".

That's good to know, I didn't realize that!

Question: when you say "optionally" status and "optionally" folder, does it work the same if I don't use status in the query? It's been complicated to get a side-by-side comparison and I'm not sure how to do a query without using the cache, but I THINK that by adding status to the query it got faster...

Sort of, but with passwords it's about encryption so that you don't have to store the actual password. In your case it would be creating a hash string that uniquely identifies each message, the same way that a CRC checksum is often used to verify that a file is unchanged

This is still getting over my head... I looked through the MySQL docs, and all I really find on hashing is about password encryption. So I'm not getting the purpose or value of it here.

Do you mean that I would have a 3rd column (eg, "id", "hash", and "message"), then use MD5 on "message" to create a 128-bit hash? Then just query for a matching hash instead of a matching message?

Right now I'm leaning towards not worrying about the hash, but understanding the concept might make some other tables run better when I have a UNIQUE index.
8:12 am on Sept 27, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Question: when you say "optionally" status and "optionally" folder, does it work the same if I don't use status in the query? It's been complicated to get a side-by-side comparison and I'm not sure how to do a query without using the cache, but I THINK that by adding status to the query it got faster...

If your index is in the order "username", "status", "folder", but your query only calls for "username" and "folder", then the index will only be consulted for the "username" part, because "status" is missing in the middle. So it works only left-to-right, basically. It will work for "username, status, folder", "username, status" or "username", but not for "status, folder", "username, folder", "folder", etc. So it's pretty limiting actually, but if you often query the database using those three columns in the same order, a combined index will be faster than separate indexes on those columns.

Do you mean that I would have a 3rd column (eg, "id", "hash", and "message"), then use MD5 on "message" to create a 128-bit hash? Then just query for a matching hash instead of a matching message?

Exactly. It's not something from the MySQL docs, just a trick you could use to avoid querying the full message. Assuming the majority of messages aren't duplicates, however, I think you can probably do without this.

If most of the duplication occurs when people send the same message to multiple recipients, then you could just store the message once:

message_id.........message_sender.........message_text
3544...............52363.................."Hello world"

...and then use another table to link the message to multiple recipients:

message_id.........message_recipient
3544...............79237
3544...............63723
3544...............97427

In that scenario, it's probably not worth it to store a hash of each message, as the space the hashes take up may very well be larger than what they'd save you on message duplicates.

Still, if you were to use a hash, e.g. because a relatively large number of messages tend to be duplicates, you might split that first table:

message_id.........message_text.........message_hash
3544..............."Hello world"........8bd69e52

...and create another table linking a message to 1 or more senders:

message_id.........message_sender
3544...............52363
3544...............78373

...so that if multiple people send a message "Hello world", you would only need to store it once. You'd calculate the hash of the submitted message in PHP, look for it in the database, and if it exists you just link that message ID to another sender and recipient.

But you should probably ignore that ;-)