Forum Moderators: coopster

Message Too Old, No Replies

Rand() related

Regards probality

         

henry0

6:36 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do not know how to perform a probability check.

This is indirectly related to PHP.

I would like generating a rand ID using rand() and a combo of letters and number
To keep it simple I wish to only use 6 chars
So the question is:
What is the probability that out of 100k such rand IDs a duplicate could be generated?
I could indeed match it VS previously DB stored but if the frequency of possible duplication is somehow great then it becomes detrimental and I need finding a way with less duplication potentiality

Hope it's clear :)

jatar_k

6:44 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



it depends on how many you plan on using out of the total number, as you use them up you increase your probability of hitting a dupe.

if you're using letters a-z and 0-9 then th possible permutations is much higher than 100K

eelixduppy

6:48 pm on Feb 21, 2007 (gmt 0)




if you're using letters a-z and 0-9 then th possible permutations is much higher than 100K

2,176,782,336 I believe ;)

The odds are slim, but still possible. Why not use some type of auto-increment?

henry0

6:51 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks
Sorry I was not clear enough
those are not going to be generated ahead of time
but a conservative projection calls for at least 100K
member id.

henry0

6:53 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could not explain the logic calling for rand
but I cannot use increment.

jatar_k

7:03 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



>> 2,176,782,336 I believe

1,402,410,240 I believe

n!/(n-m)!

where
n=36 (possible chars per position - a-z+0-9)
m=6 (number of chars in set)

;)

100k may be ok, that is quite a small set for that number of possibilities

if you wanted to reduce the probability increase the number of chars to 8 or 10

8 would be 1.22 x 10 to the power of 12

10 would be 9.22 x 10 to power of 14

either increases the number of possibilities by quite a bit

eelixduppy

7:14 pm on Feb 21, 2007 (gmt 0)



noooooooooooooooooooo....goes to show you how well I do with probability and statistics!

What he said!...umm...yea! ;)

henry0

7:16 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks jatar_k
I learned something today!
and I will increase it from 6 to 8.

jatar_k

8:24 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I knew what it was called which made it easy to look up the actual equations

high school math was an awfully long time ago ;)

eelixduppy

10:27 pm on Feb 21, 2007 (gmt 0)



I don't want this thread to go off topic, but I think I may have been actually correct. Permutations are pretty much without replacement, so by saying "we have 36 characters taken 6 at a time" means that the same character cannot be repeated. Let's put it this way, let's say you have three students A, B, and C, and you want to know how many permutations taken 2 at a time. So you have:

AB
BA
AC
CA
BC
CB

But I don't think that AA, BB or CC are valid. If this is the case, then what you are saying is that there are however many (1.whatever billion) permutations without repeating a character. Therefore, I believe the method I originally used


36*36*36*36*36*36 = 2,176,782,336 to be correct.

But then again, I could be wrong :)

Sorry for taking the thread off topic

jatar_k

10:36 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



ah, henry is done so we can mess with his thread ;)

true, I was looking at combinations, so you didn't get an id like aaaaaa, which may be an issue

so yes, you're exactly right eelix

henry0

11:05 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll mess yours next :)
You guys are showing resulting single permutations number.

but the real trick is to find the propability that any rand result will/could/might come back twice if you hit refresh 100K

let the experts scratch their head.

eelixduppy

11:11 pm on Feb 21, 2007 (gmt 0)



Well the probability that one id is selected to match another is 1/2,176,782,336. So if you are talking about 100k ids then I believe it would be 100,000/2,176,782,336 = 4.59 x 10^-5 which is 0.0046 %. If you ask me, it is a small probability.

I did get C's in my stats class; you may want to consult an expert ;)

henry0

11:24 pm on Feb 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Unless I am wrong it will result in 5 matches
so I also need to add some sort of if exists....
Well, anyway I won't feel ok if I did not test for dupe.
thanks all that stuff is kind of fun!

eelixduppy

11:32 pm on Feb 21, 2007 (gmt 0)




Unless I am wrong it will result in 5 matches

Well, it doesn't have to as it's based on probability, but it is probably the best idea to check for duplicates anyway.

jatar_k

12:52 am on Feb 22, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



sends 'em all into a frenzy, good stuff henry

my honest opinion is that random ids are never the best

you can add logic to them as well

letters from first and/or last name
date of birth
append a short random set of digits

any other info you have about the member can be truncated and used to build something that will be truly unique

especially if you can't use an auto increment it makes me think you will use the actual number somehow where the user can see it

makes it easier for people to remember if there is logic around it

henry0

11:56 am on Feb 22, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now we are speaking!
This is really a neat way to look at it, good solution.
your suggestion should be a model
bookmark it for future ref.

whoisgregg

5:24 pm on Feb 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's one idea I've tossed around for the issue of a public facing key that appears to be random. Instead of generating concatenated keys or iterating through random strings until I find one that hasn't been used yet, I figured an md5 hash of the auto-incrementing serial number itself would be sufficient.

32 characters is a lot if you also expect people to retype the key. If you want fewer characters, just use a substring of the md5 hash. Here's how many unique public keys you'll get before experiencing a collision based on the length of the substring (assuming you start at position 0, starting at different positions will yield different results):

8 characters: 82,944
9 characters: 400,703
10 characters: 441,590
11 characters: 6,283,303

I didn't go any higher than 11 characters, but I would assume that even just a 14-16 character substring would yield a suitably high number of uniques for many applications.

I generated these numbers by populating a table with md5 hashes of the serial numbers with different length UNIQUE keys set on the md5 column which I assume to be identical to inserting a substring(md5($id),0,11) into a VARCHAR(11) field.

henry0

7:21 pm on Feb 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is a great solution
but something I do not get is:
eelix and jatar_k after a bit of fine tuning came to agree that the number of permutations using 6 was equal to: 1,402,410,240

but even with 8 your number is lower?

I am just trying to understand the math because your idea sounds real easy and making sense.

Could you define the math behind your number?
thanks

eelixduppy

8:07 pm on Feb 23, 2007 (gmt 0)



Remember you can also make the ID case sensitive, which brings in a whole new slew of possibilities ;)

whoisgregg

8:13 pm on Feb 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not a probability figure... I basically ran substring(md5($id), 0, 8) on auto-incrementing serial numbers until it failed at record 82,945. Then I ran substring(md5($id), 0, 9) until it failed at record 400,703. Etc...

Instead of the possibility of collision points (although small) for every new record, the idea I tossed out there gives you clearly defined collision points so you won't have to iterate through random attempts until you find an unused primary key.

I worry about comparing the uniqueness of random keys [webmasterworld.com] because of what ergophobe pointed out to me a couple years back:

once you hit say 1.45 quindecillion records, it could take days to generate a unique key couldn't it?

Even concatenated keys make me worry because I have seen concatenated keys like "first two characters of the first name, last character of the last name, middle three digits of the phone number, and a random 4 digit number" end up with a shocking number of collisions over just a hundred thousand records.

jatar_k

8:29 pm on Feb 23, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



which is why I never use random keys

I do like adding logic to my keys when necessary but the numbers used in them are always sequential in some way

whenever I contemplate randomness I always think of rand functions when I started programming. If you didn't manually seed them they always came up with the same sequence of numbers.