First, there's already a ton of prior work available online for this problem. Search github (or Google) for "rainbow table" and you may be able to save yourself a bunch of work.
Let's imagine we were building this system as a flat file store and, along the way, we might learn why partitioning can help us. Basic Approach:
Take your list of words and create a hash for each one. Create a file with the hash as it's name and the contents of the file being the word. Example:
Operating systems, particularly GUIs, have a lot of trouble with this many files in one directory (because they are also trying to build indexes and sort things just like a database
). If you did this with billions of files then tried to open this folder your machine will just hang forever. (Actually, the script doing the work would also have trouble and you'd never be able to finish populating this directory.) Getting Closer:
Let's take the first couple characters and make subdirectories based on those:
We've just reduced the number of files in each directory by 256! However, since you are talking about billions of rows, your OS will probably still have some issues. We have to go deeper! Overkill:
We slice up the directories by every two characters until all we have left is a two character file name nested under a bunch of directories:
Now instead of needing to deal with billions of files in one insanely huge directory, your computer is dealing with directories that each contain no more than 256 files/directories. So what does this have to do with partitioning?
Modern computers are powerful enough to deal with more than 256 files at a time, which is why doing the above is overkill. A good partitioning scheme breaks the data across the "right size" to be quickly searchable. The partition makes sense to be based on how you are going to perform your queries:
Each of those tables ONLY contains hashes that start with the same four characters in the name of the table. So when you take your input (a hash), you take the first four characters of that hash, then search in the appropriate table. Doing this means you've already narrowed the number of possible rows by 65,536 before even asking the database to do any work!
Some other suggestions:
1. Break your md5 and sha1 hashes into separate tables entirely. (You are never searching on both right? Keep them separate for smaller table sizes.)
2. Have the hash be the first column and the word be the second column. (Fixed length columns should come before variable length columns. Columns you search on should come before columns you return.)