choose random item from list but with different weighting

Forum Moderators: coopster

Message Too Old, No Replies

choose random item from list but with different weighting

without huge array

ergophobe

5:30 pm on May 12, 2005 (gmt 0)

I want to choose an item at random, but have different probabilities for each item. One easy way is just to create an array like

$a[]=1
$a[]=2
$a[]=3
$a[]=3

That way a random number between 1 and 4 would give the weighting I want, but it seems like there has to be a much smarter way to do this. I've seen some algorithms for random distributions, but they don't quite do what I want. Oh yeah - must not require a DB; this should work with values from a CSV or similar

jezra

8:00 pm on May 12, 2005 (gmt 0)

If the weight of the item is in the csv file, for example:


item_name,item_weight
widget_blue,1
widget_yellow,1
widget_red,2
widget_purple,2
widget_green,3
widget_white,3
widget_black,4
widget_orange,4

You could try something like this


$randNum = rand(1,15)
switch($randNum)
{
 case 1:
 $weightNum=1;
 break;
 case 2:
 case 3:
 $weightNum=2;
 break;
 case 4:
 case 5:
 case 6:
 case 7:
 $weightNum=3;
 break;
 default:
 $weightNum=4;
}

Now read the cvs file and create an array of cvs file items whose weight matches $weightNum. Then perform array_rand [us3.php.net]on your new, hopefully small, array. I hope this help.

ergophobe

8:55 pm on May 12, 2005 (gmt 0)

Thanks for the idea. Which way you go - big array or set of case statements - depends on the number of items versus the number of weights allowed.

Your method gets the array size down and would definitely help in the case of many items and only a small choice of different weightings. With a larger number of weights though, it gets cumbersome fast as the total lines of code is:

(n^2 - n) / 2 case statements
+ 1 for the default
+ n assignment statements
+ (n-1) break statements

where n is the number of different weightings allowed. So in the case you gave, I get

(4^2 - 4) / 2 = 6 case + 1 default + 4 assignments + 3 breaks = 14 lines.

(you have 15, but that's just 'cause you accidentally have one too many cases in the last set). That's pretty reasonable.

However, it doesn't scale very well. If I wanted to do like Launchcast and have weightings from 1-100, I end up with

4500 + 1 + 100 + 99 = 4800 lines of code.

If I have 100 items with an average weight of 50, I end up with an array with 5000 elements, all small integers, which is probably better than 4800 lines of code (concurrent requests are not an issue here, so caching code versus generating a new array for each user wouldn't have any advantage/disadvantage).

The problem with both methods is that they scale poorly
- mine scales as the product of number of items times average weight
- yours scales as the squares exponentially

There must be a good way to scale this more linearly, as you can get sorts to scale poorly or well depending on the algorithm (that thought got me to consider tree data structures, BTW, but I couldn't see how they could give me an advantage in this case).

dmorison

9:19 pm on May 12, 2005 (gmt 0)

If disk space isn't a problem, and if each record in the CSV file is of similar length, how about pre-processing the file to create a second CSV file that has each record duplicated n times, where n is in proportion with the weighting factor for that record.

Then, when you want to choose; fseek() into the file a random number of bytes using rand(filesize()) or something like that, then fgets() to scan to the end of the line you landed on, and finally fgets() once more to read a full record.

dmorison

9:40 pm on May 12, 2005 (gmt 0)

...alternatively, if the error introduced because the records are not all the same length is too great (or disk space is an issue), how about using the same idea but only creating an index file pointing at the original CSV.

ergophobe

10:05 pm on May 12, 2005 (gmt 0)

You cheater!

Good idea though. I'll have to look into that, especially since there are no actual limitations in terms of flie format and I don't think the file would ever get that big - 1000 integer-indexed items with an average weight of 50 should still only be 50KB (or perhaps 100KB). In any case, relatively small.

Just because I'm stubborn that way, I'll keep exploring for algorithms. I wish I had a copy of Knuth - 10:1 if you look in there you find the most efficient algo know to man for doing just this.