Creating Index on Large Table

Forum Moderators: open

Message Too Old, No Replies

Creating Index on Large Table

l008comm

7:30 am on Nov 1, 2014 (gmt 0)

I'm having trouble creating indexes on this one table. It is a large table, 40 billion rows, 1.7 TB. It lives on a 3 TB drive, so with formatting, there is about 1 TB of free space.

So I create the index, and it starts copying all of the data into a temp table. Well of course, this filled up the drive eventually, and long before it was done. So my questions are A) what is happening now and B) how can I create this index? I have another index to create after this too. I should be able to fit them both on this drive, they're only indexing the first few characters of a binary text field.

So the current state of things, is that the index creation query is still running. It's on the "copy to tmp" stage. However on my server, there is no more disk activity of any significant kind, and very little MYSQL cpu usage. It doesn't look like the server is doing anything. It looks like once the drive filled, the task "paused" itself.

What do I do?

whoisgregg

4:04 pm on Nov 1, 2014 (gmt 0)

I don't have direct experience using the built-in feature, but I think this is the sort of problem for which partitioning [dev.mysql.com] was designed.

l008comm

9:37 pm on Nov 1, 2014 (gmt 0)

I don't see how partitioning would help me, since my table easily fits on it's drive?

brotherhood of LAN

10:39 pm on Nov 1, 2014 (gmt 0)

How are you measuring disk activity?

If that 1TB of data wasn't put in all at once and it's on a HDD, it could be that the data is fragmented and taking longer than you think it should. If you're on *nix there's a program called iotop that'll show you how busy the disk is and what's causing the IO.

l008comm

10:43 pm on Nov 1, 2014 (gmt 0)

I'm on OS X and am using Activity Monitor to see raw in and out, and activity is essentially flatlined. Also the 1.7 TB was put in all at once, on a freshly formatted drive.

incrediBILL

11:45 pm on Nov 1, 2014 (gmt 0)

When you add an index it basically has to update the table and how this is accomplished varies by database, but rule of thumb is you need double the space of the file.

Sometimes doing these things runs out of memory, disk space etc. because it's poorly written code.

How I would accomplish this would be to make a new table and import the data into the new table, but again you need double the space.

You need more disk space is the best answer to this problem, plain and simple.

l008comm

11:57 pm on Nov 1, 2014 (gmt 0)

That is kind of a bummer. I originally tried inserted my data into an indexed table, but the insert speed was so much slower, it would end up taking a year to build the table, instead of a month and a half. So theres no alternate way to create an index on a table, that doesn't require temporarily duplicating the table?

brotherhood of LAN

12:08 am on Nov 2, 2014 (gmt 0)

Partitioning might actually be relevant, if there's a sensible field you can partition on. If there's a integer primary key then perfect.

For example, consider dumping the data into files, say 256 and create a new table with 256 partitions. Base the partition on the modulus of the primary key. Dump the data into the files, deleting from the original table if necessary. Then sort each individual file and then insert into your partitioned table. Finally, add the index.

It really depends on the data and table types.

Adding an index on a table of that size will invariably take a long time.

incrediBILL

12:55 am on Nov 2, 2014 (gmt 0)

Adding an index on a table of that size will invariably take a long time.

It shouldn't.

I used to work on DB tech in the 80s and 90s and even built a couple from scratch, and building an index shouldn't take much more than the time it takes to read the original data to extract the keys and so a quick sort or the hey data to generate a new index file, did it myself.

Don;'t know why modern tables work the way they do as adding an ad-hoc index should be a trivial and quick undertaking, but it's not.

I'd really have to see the actual data to come up with a plan to reindex within the same space, but I've managed to do it before but we had to jump through hoops which I might be able to do, but doubling the disk space is the path of least resistance.

brotherhood of LAN

1:09 am on Nov 2, 2014 (gmt 0)

Yeah, there's a lot of "it depends" on recent RDBMS I guess. You can tune the DB to speed it up while doing a big insert though generally it's going to take quite a bit longer than the time a hard disk would take to read x MBytes.

Storage engines like InnoDB insist on a primary key and will create one if you don't supply one. Indexes are clustered by primary key too... those tables quickly get bigger than your original data size without any extra indexes.

l008comm

1:10 am on Nov 2, 2014 (gmt 0)

Well since everyone wants to see the data, here's three rows:


+----+----------------------------------+------------------------------------------+
| 1d | cc7a1e792bca8ccb1946b7a07f6dbc03 | 11a2757082428311f587b7664fa9840376137f80 |
| 1e | c5f590d18e1ea128362eeafb7192cc21 | 0859d0c4f1baa7dcb07cb1aec4a1263ec39229e4 |
| 1f | 0cde80d4d65fc5307a07f86ac9dfa34c | bd3890d2a3e58ad32d304b28c6b00ba03dad5e91 |
+----+----------------------------------+------------------------------------------+

There is no primary key. This data is created all at once by a script, and I will never edit or change any of it once it's done. I just need a way to search the database.

ALSO I'm not really storing the hex values shown, I'm storing the raw binary data so those fields are half the data size they appear.

If it's not clear, column 1 is the word, column 2 is the md5 hash and column 3 is the sha1 hash. I'm ultimately trying to make two indexes, one on the first 4 characters of the md5 field (aka the first 8 hex characters), and one on the first 5 (10) characters of the sha1 field.

This is the query I'm using to make the indexes:

ALTER TABLE `hash_db` ADD INDEX `md5_quarter` (`md5`(4))

l008comm

1:12 am on Nov 2, 2014 (gmt 0)

Also note that it is a myisam table, mainly because the insert speed was many times faster than the insert speed on innodb tables.

Also, my ultimate goal here, was to completely fill a 3 TB hard drive with the largest hash DB I could fit on there. And then possibly repeat on another drive or two, making things even larger.

brotherhood of LAN

1:25 am on Nov 2, 2014 (gmt 0)

Fixed length and unchanging is great, you can optimise a lot with that.

It seems the first column has the potential to be a primary key.

It'd be much more optimal to store the MD5 and SHA1 in binary columns, half the size of what they are at the moment. An MD5 should be BINARY(16) and SHA1 a BINARY(20). That makes it more space efficient and quicker to search against. HEX() and UNHEX() would be used to transform the values back and forth. Your resultant index would only be 2 bytes in length.

For a lookup table like that, I'd be inclined to have the MD5s and SHA1s in separate partitioned tables, using the first 12 bits to define which partition each row goes into (MySQL has a limit of 8192 partitions per table). Those hashing schemes will ensure data is evenly distributed across all partitions. I'd maybe even just have the first 4 bytes of each hash, alongside the PK which the lookups are done against and then join with the PK onto the full hashes.

Definitely use BINARY columns whatever you do, that in itself will half your disk space usage.

l008comm

1:36 am on Nov 2, 2014 (gmt 0)

Yes read below the table sample. I AM storing in binary already. I convert to hex for display purposes.

But I still don't see why I even need a primary key, or a partition. If it's all going on one big drive anyway? I don't understand what the benefit of partitioning is?

brotherhood of LAN

7:39 am on Nov 2, 2014 (gmt 0)

Gotcha, I wasn't 100% certain that you'd stored them as such.

Partitions are just like tables, but save you creating lots of tables with the same schema. Using my example of 4096 partitions (split by hash), and your index of 4 bytes... your queries would be on data/indexes 1/4096th the size of your full table.

whoisgregg

7:23 pm on Nov 7, 2014 (gmt 0)

First, there's already a ton of prior work available online for this problem. Search github (or Google) for "rainbow table" and you may be able to save yourself a bunch of work.

Let's imagine we were building this system as a flat file store and, along the way, we might learn why partitioning can help us.

Basic Approach: Take your list of words and create a hash for each one. Create a file with the hash as it's name and the contents of the file being the word. Example:

/md5/cc7a1e792bca8ccb1946b7a07f6dbc03.txt
/md5/c5f590d18e1ea128362eeafb7192cc21.txt
/md5/0cde80d4d65fc5307a07f86ac9dfa34c.txt

Problem: Operating systems, particularly GUIs, have a lot of trouble with this many files in one directory (because they are also trying to build indexes and sort things just like a database). If you did this with billions of files then tried to open this folder your machine will just hang forever. (Actually, the script doing the work would also have trouble and you'd never be able to finish populating this directory.)

Getting Closer: Let's take the first couple characters and make subdirectories based on those:

/md5/cc/7a1e792bca8ccb1946b7a07f6dbc03.txt
/md5/c5/f590d18e1ea128362eeafb7192cc21.txt
/md5/0c/de80d4d65fc5307a07f86ac9dfa34c.txt

We've just reduced the number of files in each directory by 256! However, since you are talking about billions of rows, your OS will probably still have some issues. We have to go deeper!

Overkill: We slice up the directories by every two characters until all we have left is a two character file name nested under a bunch of directories:

/md5/cc/7a/1e/79/2b/ca/8c/cb/19/46/b7/a0/7f/6d/bc/03.txt
/md5/c5/f5/90/d1/8e/1e/a1/28/36/2e/ea/fb/71/92/cc/21.txt
/md5/0c/de/80/d4/d6/5f/c5/30/7a/07/f8/6a/c9/df/a3/4c.txt

Now instead of needing to deal with billions of files in one insanely huge directory, your computer is dealing with directories that each contain no more than 256 files/directories.

So what does this have to do with partitioning?
Modern computers are powerful enough to deal with more than 256 files at a time, which is why doing the above is overkill. A good partitioning scheme breaks the data across the "right size" to be quickly searchable. The partition makes sense to be based on how you are going to perform your queries:

`md5_cc7a`
`md5_c5f5`
`md5_0cde`

Each of those tables ONLY contains hashes that start with the same four characters in the name of the table. So when you take your input (a hash), you take the first four characters of that hash, then search in the appropriate table. Doing this means you've already narrowed the number of possible rows by 65,536 before even asking the database to do any work!

Some other suggestions:
1. Break your md5 and sha1 hashes into separate tables entirely. (You are never searching on both right? Keep them separate for smaller table sizes.)
2. Have the hash be the first column and the word be the second column. (Fixed length columns should come before variable length columns. Columns you search on should come before columns you return.)