Forum Moderators: coopster

Message Too Old, No Replies

Generating Unique Primary Keys for MySQL with PHP

Serial Numbers? Seriously?

         

whoisgregg

11:12 pm on Jul 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm looking at some of my MySQL databases that rely on auto-incremented serial numbers for primary keys and I've decided it's time to come up with a better way as my projects are becoming increasingly complex.

I wouldn't want to be in the position of writing a solution that used up the 32,768 keys per month their SMALLINT serial number generator (just a guess) was producing and left thousands of customers stranded on Christmas day. [story [cincypost.com]]

I've put this function together that (as shown) has 3636 possible keys, or a little over 1.46 quindecillion. (It scales easily to smaller or longer keys.)

function produceFib($length) {
$val = '';
for($i=0;$i<$length;$i++) {
$str = 'abcdefghijklmnopqrstuvwxyz0123456789';
$shuffled = str_shuffle($str);
$val .= substr($shuffled, 0, 1);
}
return $val;
}
echo produceFib(36);

So here's my questions:

  • Is this overkill?
  • Do I need to do something more complex?
  • How do you do it?
  • BarryStCyr

    11:54 pm on Jul 7, 2005 (gmt 0)

    10+ Year Member



    If you don't like smallint make it unsigned bigint with auto_increment. That will give you a number up to 18446744073709551615.

    Hope this helps.

    whoisgregg

    4:32 am on Jul 8, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I don't have a problem with smallint -- just the idea of using serial numbers. Are serial numbers considered best practices for mysql databases?

    I feel like I've been shortcutting so far, but I'm very open to the idea that I'm already doing things the right way. :)

    coopster

    12:15 pm on Jul 8, 2005 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member



    I'm unclear as to whether the discussion is in regards to the practice of using a serialized number, or a better way to handle the practice. I'll chime in on the first ...

    If you are absolutely certain the primary key won't change, you can use something other than a serialized number value. But you must really think it through in your design. For example, let's say you have a user table and you set the primary key as the username because, hey, each and every user must be unique, right? Well, Jane Johnson signs up as janeJohnson and later that year marries John Doe. Now she wants her username to be janeDoe -- too late for you as you cannot change the primary key, not without some difficulty anyway, because all your other related tables are linked by Primary-to-Foreign key by username relationship.

    As far as the handling of a serialized number, there are many ways to go about it. Increasing the maximum value by changing column type is one option as suggested and should meet most if not all needs. If you are getting beyond that you can definitely come up with your own system or possibly even revisit your application's structure.

    ergophobe

    5:09 pm on Jul 8, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I realize this is rather far down the road, but with your method, once you hit say 1.45 quindecillion records, it could take days to generate a unique key couldn't it? In other words, it looks like you're generating a random string, then I assume checking elsewhere that it's unique and if so, using it, if not, rejecting it and trying again.

    If I didn't want to use a serial autoincrement or thought I was going to exceed the unsigned bigint number, I would probably do something like an MD5 hash that took the time and user's first name as input. That would result in very few random collisions and would still give you 3632 combinations and if that wasn't enough, I could do something like concatenate a hash of the time and hash of the last name for 3664, but I hope I'll retire before I ever see a database as big as either of those!

    whoisgregg

    7:41 pm on Jul 8, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    How does concatenating the MD5 hash of the time and user name reduce collision compared to just concatenating the time and user name directly?

    Or is there a different benefit to using the MD5 hash?

    Also, thanks to all of you for your input -- I've been working in the dark on this concept because there just aren't practical discussions out there to reference.

    whoisgregg

    7:51 pm on Jul 8, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    One idea from an offline database person... Using the random string generator above, couldn't I just increase the character length by one each time there was a collision? In other words, the first time the 36 characters have a match, the length is increased to 37 characters. (Using this method, the system could start with as few as 5 or 6 characters and just expand over time to accomodate any quantity of records.)

    ergophobe

    8:54 pm on Jul 8, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




    How does concatenating the MD5 hash of the time and user name reduce collision compared to just concatenating the time and user name directly?

    Each hash has a 1 in 3632 chance of collision. With two concatenated hashes, I need to pull that off twice instead of just once in order for the key in question to have a collision.

    In other words, there is a 1 in 3632 that md5("stringA") and md5("stringB") will have a collision and a 1 in 3632 that md5("stringC" and md5("stringD") will have a collision, which is a 1 in 3632 * 1 in 3632 chance that the two of them concatenated will have a collision.

    There is, however, a 1 in 3632 chance that md5("stringAstringC") will collide with md5("stringBstringD")


    Or is there a different benefit to using the MD5 hash?

    Just that
    1. it's a built-in function, so likely very fast
    2. it's highly portable
    3. it can be seeded with a non-random string (like a timestamp converted to a string) to reduce the chance of collisions

    mogenshoj

    9:00 pm on Jul 8, 2005 (gmt 0)

    10+ Year Member



    I use MD5 as a "serial number" in my database, i recently started a free online counter (non english), it has about 300.000 mysql calls a day. And 98% using the serial number to find the correct data.

    It has no problems yet, and i can find the correct ID of a row, without seaching the database because the md5 will make the correct number for me.

    whoisgregg

    7:26 pm on Jul 9, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Thanks very much for the insights everybody! I have a much better idea now of how to go about setting up primary keys. :)