Welcome to WebmasterWorld Guest from 54.167.62.170

Forum Moderators: open

Message Too Old, No Replies

MySQL: Index size

     
8:53 am on Apr 12, 2017 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 15, 2013
posts: 570
votes: 47


I have a column that's
ENUM('book', 'home', 'recipe', 'sports', 'tech')
.

Do I understand correctly that when I create an index that includes this column, I could set the size to 1 since they all have a unique first character?

If I'm following it correctly, the index would then only save the first letter of the column instead of the entire word *.

The table has 14 million rows, the data is 462.6M, and the index is 611M (just one UNIQUE index), so I'm trying to cut down storage space, and hopefully increase speed a little if I can.

* In retrospect, I do realize that I could just make the column ENUM('b', 'h', ...) and save even more space, but the answer to this will help me with future structures, too.
4:39 pm on Apr 12, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3355
votes: 39


* In retrospect, I do realize that I could just make the column ENUM('b', 'h', ...) and save even more space, but the answer to this will help me with future structures, too.


you won't save space by changing to ENUM('b','h','r','s','t') because MySQL creates a numerical index (not to be confused with a db sorting index) of the possible values and inserts a value of 1, 2, 3, 4 or 5 into the column when you do an insert ... or 0 = empty (it then looks up the ENUM index if the column is selected and converts the number back to the string value to return the result) ...this is why it can be confusing to have enum values that are numbers as if you insert them as integers rather than strings you may well get unexpected results!
6:30 pm on Apr 12, 2017 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 15, 2013
posts: 570
votes: 47


Ahh, I see. Why is the index 25% larger than the data, then, when it's only one index? The table has 5 columns and the index has 4 of them (necessary to be unique).

And for future reference, would this be different if I were using VARCHAR instead of ENUM? I have another table that is similar but the column has 51 different possible options so I left it as VARCHAR. But the first 3 characters would be unique, so would I save time and space by setting that column's index size to 3?
8:03 am on Apr 14, 2017 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 15, 2013
posts: 570
votes: 47


Just as an FYI, I did a test run... first I changed the VARCHAR's index size to 3 instead of 12, but the index grew by almost 100%! Then I changed it back to not specifying a size, and instead changed the VARCHAR to ENUM. And you're right, the index dropped in size by 50% of the original.

Using ENUM is a little bit of a pain for me, since as I grow to create new categories the ENUM list will have to be updated manually. But the smaller index should have a significant impact on speed, so it's a fair trade off.
9:54 am on Apr 14, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3355
votes: 39


since as I grow to create new categories the ENUM list will have to be updated manually.


in order to change the list of values in the enum you would need to: ALTER TABLE.

... are you using the enum column to join to another table?
if so, it is much quicker to join ENUM to ENUM than ENUM to VARCHAR
7:43 pm on Apr 14, 2017 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 15, 2013
posts: 570
votes: 47


No, but it's possible that there's a better way to do what I'm doing.

My site has several different message boards (one for books, one for home, etc). All of the posts are stored in the same table, and the script just determines which posts to show based on the URI.

So example.com/book would show posts with "book" in the category column, example.com/sports would show posts with "sports" in the category column, and so on.

But it's entirely possible that, in the future, I'll add a new message board, since all it really takes on my end is adding a few variables to the script and, now, altering the table to add it to the ENUM. But the table itself is 2.4G, so that alteration is pretty major.

It's also notable that I have 8 tables that have the same ENUM list, so I have to make that same alteration to each of them.

This isn't something that's going to happen daily or anything, but it DOES happen. Which is why I originally used VARCHAR. But the index is significantly faster now, so it's worth the trade unless you guys can suggest something better.
9:42 pm on Apr 14, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3355
votes: 39


personally i'd have created a table message_boards with 2 columns, one a primary key tinyint, the second a varchar name_of_category

then when referencing the category in other tables i'd use the tinyint primarykey, this could easily be hard coded into your code if there are not many categories, but it means you can add categories easily.

why have you got 8 tables with the same enum list?
12:20 am on Apr 15, 2017 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 15, 2013
posts: 570
votes: 47


Interesting, topr8. If I do it that way and LEFT JOIN, would it be just as fast as having it in 1 table?

For the example I posted, I have another table for "subjects" (which includes the author, last poster, size count, and subject of the thread), a "backup" (which is a complete copy of the posts table, including posts removed by a moderator), an "unfiltered" table (similar to backup, but includes posts that were automatically filtered for profanity or a site violation). Then there's another table that counts raw "pageviews" for each category and ID, so the author can see how many people have read their thread.

I have 50 domains parked on top of the main domain, and the PHP scripts read the domain then show relevant content; this lets me use one set of scripts for essentially 51 websites. So I have another table for "subjects_category" that shows which subjects belong to which website; one thread can be relevant for multiple sites, so this went in a table of its own.

That means I also have VARCHAR sections with 1 of 51 words in it. I tried changing one of those to an ENUM but the index actually became larger; which isn't a shock, I read somewhere that we start to lose efficiency on an ENUM after 20 values. So the idea of a separate table that gives each of them an ID, and then cross-referencing that ID to the domain, might work well for that, too. I'll have to do some experimenting, I think :-)