Forum Moderators: coopster
I wouldn't want to be in the position of writing a solution that used up the 32,768 keys per month their SMALLINT serial number generator (just a guess) was producing and left thousands of customers stranded on Christmas day. [story [cincypost.com]]
I've put this function together that (as shown) has 3636 possible keys, or a little over 1.46 quindecillion. (It scales easily to smaller or longer keys.)
function produceFib($length) {
$val = '';
for($i=0;$i<$length;$i++) {
$str = 'abcdefghijklmnopqrstuvwxyz0123456789';
$shuffled = str_shuffle($str);
$val .= substr($shuffled, 0, 1);
}
return $val;
}
echo produceFib(36); So here's my questions:
If you are absolutely certain the primary key won't change, you can use something other than a serialized number value. But you must really think it through in your design. For example, let's say you have a user table and you set the primary key as the username because, hey, each and every user must be unique, right? Well, Jane Johnson signs up as janeJohnson and later that year marries John Doe. Now she wants her username to be janeDoe -- too late for you as you cannot change the primary key, not without some difficulty anyway, because all your other related tables are linked by Primary-to-Foreign key by username relationship.
As far as the handling of a serialized number, there are many ways to go about it. Increasing the maximum value by changing column type is one option as suggested and should meet most if not all needs. If you are getting beyond that you can definitely come up with your own system or possibly even revisit your application's structure.
If I didn't want to use a serial autoincrement or thought I was going to exceed the unsigned bigint number, I would probably do something like an MD5 hash that took the time and user's first name as input. That would result in very few random collisions and would still give you 3632 combinations and if that wasn't enough, I could do something like concatenate a hash of the time and hash of the last name for 3664, but I hope I'll retire before I ever see a database as big as either of those!
Or is there a different benefit to using the MD5 hash?
Also, thanks to all of you for your input -- I've been working in the dark on this concept because there just aren't practical discussions out there to reference.
How does concatenating the MD5 hash of the time and user name reduce collision compared to just concatenating the time and user name directly?
Each hash has a 1 in 3632 chance of collision. With two concatenated hashes, I need to pull that off twice instead of just once in order for the key in question to have a collision.
In other words, there is a 1 in 3632 that md5("stringA") and md5("stringB") will have a collision and a 1 in 3632 that md5("stringC" and md5("stringD") will have a collision, which is a 1 in 3632 * 1 in 3632 chance that the two of them concatenated will have a collision.
There is, however, a 1 in 3632 chance that md5("stringAstringC") will collide with md5("stringBstringD")
Or is there a different benefit to using the MD5 hash?
Just that
1. it's a built-in function, so likely very fast
2. it's highly portable
3. it can be seeded with a non-random string (like a timestamp converted to a string) to reduce the chance of collisions
It has no problems yet, and i can find the correct ID of a row, without seaching the database because the md5 will make the correct number for me.