An ASCII character primary key, where the lowest number of characters are used for starters, then put together, then in 3's etc until all combinations run out......basically the smallest primary keys possible. It's to be used for "word id's" in a data dictionary.
The dictionary has already been sorted out into word frequency, so all i need is a script to do the necessaries with the ASCII primary key.
Can anyone offer some tips as to how to go about this? :)
What do you mean by small? Small for the database, or small when transferred over the net? If the latter, I guess you're right. If you mean the former, then you're wrong.
When you say ASCII, I assume that you mean "char".
0-9, a-z, A-Z = 1-62 = for 1 char
00-99, 0a-9a, 0A-0Z... = 1-62^2 = 3844 for 2 chars
000 - ZZZ gives you 62^3 = 238328 for 3 chars
That gives you 242234 for 1-3 chars.
In terms of transfer over the net, you have some advantage in that three bytes let you transfer a key as large as you could with six bytes using integers. Of course, optimizing *one kb* off of an image makes up for 333.3 primary keys in integers/vs ASCII. Eliminate indents from your pages and you will likely gain more unless you are making zillions of queries and they all send different primary keys over the net. Generally you should only have to send the key a few times (one or two per link).
In terms of DB storage requirements, you actually lose and are wasting space and slowing things down.
If you use col type char, that's 3 bytes / record for the primary key (or 2-4 bytes if you use varchar).
However, MEDINT also takes up three bytes and lets you store numbers up to 16777215. To make that easier to read
16777215 with medint
242234 with 3-col
65535 with smallint (2 bytes)
Both take up three bytes. You have a huge advantage with integers since each byte gets you roughly 256^(number of bytes) while each ASCII char only gets you rougly 62^(number of bytes).
Tom
This is the situation....
The keys will be a primary key for word id's, for a web directory structure, where the filenames/titles/keywords etc will be substituted with the above word id's.
Think of a huge category directory structure and substitute all unique words with numbers. Does the scale make a difference?
The thing is, I've been reading about Huffman coding and related stuff....its just to make a streamline yet large directory.
The maths you posted was something I just start to play around with on excel after I found out how to concatenate :) Not so sure this has its use now after what you said.....
I assume you meant an integer would be more wise in a mySQL type db? Would it be different if it was flatfile...ie the chars you mention 0-9 a-z A-Z...I just thought that each character took up an equal amount of space...and using integers meant that there would be less (smaller) variations to use.
If I understand brotherhood_of_LAN correctly than what he wants is a script that produces something like A - B - AA - AB - BA - BB - AAA - AAB - BBA - BBB - BAA - BAB - ABA - ABB - AAAA - AAAB - BBBA - BBBB - BBAA - BBAB - BABA - BABB - BAAA - BAAB - ABBA - ABBB - ABAA - ABAB - AABA - AABB - AAAAA - AAAAB - BBBBA - BBBBB - BBBAA - BBBAB - BBABA - BBABB - BBAAA - BBAAB. I use such a function in my cms to generate the id attributes of certain elements used for DHTML.
Hereīs the code I use:
#!/usr/bin/perluse POSIX;
use strict;my @abc = qw(A B);
for (my $j=0;$j<40;$j++){
print get_index($j), " - ";
}sub get_index {
my $n = (shift)+1;
my $b = @abc+0;
my ($c, $i) = 0;
my $ind = '';
# How many slots do we need?
for ($i=1;$c<$n;$i++) {
$c += $b**$i;
}
$i--;
# Calculate slots from left to right
for ($i;$i>1;$i--) {
my $iv = $b**($i-1);
my $x = ($n-$iv)/$iv;
$x = ceil($x)-1;
$ind .= _get_slot($x);
}
# Calculate last slot
return join '', $ind, _get_slot($n-1);
}sub _get_slot {
my $n = shift;
$n = 0 if $n < 0;
my $b = @abc;
return $abc[$n - ($b * (int($n / $b) + 1))];
}
Comments are welcome. If you have a more elegant solution, please share it with us.
More elegant? I don't think so, but it is quite different. My solution
- uses PHP not Perl
- uses recursion not looping
- allows custom lists of "digits"
Recursion is the obvious solution for this problem (in fact, probably the only solution for a real Huffman tree) and would work really well in Scheme or something, but it I'm not sure PHP does recursion well. My quick check shows, however, that memory usage is low, so maybe it's not that bad.
By the way, There is a good section on Huffman encoding [mitpress.mit.edu] in Abelson and Sussman's SICP [mitpress.mit.edu] (which many people consider the best programming book ever written and which is now entirely available on the web). This is a book worth owning in print though.
[edited by: jatar_k at 9:04 pm (utc) on Sep. 13, 2002]
[edit reason] buggy version of script removed as per author. See below for correct version [/edit]
- allows custom lists of "digits"
Thatīs what the @abc array is for.
my @abc = qw(a r o n 1 4);would use just those "digits".
- uses recursion not looping
Thatīs what I wanted to avoid.
When I ran your script, it printed the following code which does not seem to be quite right.
3, 4, 5, 6, 7, 8, 9, a, b,
c, d, e, f, g, h, i, j, k, l,
m, n, p, q, r, s, t, u, v, w,
x, y, z, A, B, C, D, E, F, G,
H, I, J, K, L, M, N, P, Q, R,
S, T, U, V, W, X, Y, Z, , 1, <--- no digit gets added
3, 4, 5, 6, 7, 8, 9, a, b, c,
d, e, f, g, h, i, j, k, l, m,
...
Andreas, I really am curious about your "aversion to recursion" (couldn't resist the alliteration) but I don't think the discussion belongs here, so I opened a new thread on it. I would appreciate it if you checked it out in the new thread on recursion [webmasterworld.com]
<?
function keygen($key, $count=1) {
// array allows custom vals - for example can omit lowercase "l"
// as well as "o" and "O" to avoid confusion with numbers 1 and 0).
$key_vals = array("0", "1",
"2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");
$max_val = 500; // number of keys to generate
$length = strlen($key);
$AddDigit = TRUE;
for ($i=0; $i<$length; $i++) {
$index[$i] = array_search(substr($key, -($i+1), 1), $key_vals);
$position = $length-$i;
if ($index[$i] < (sizeof($key_vals) -1)) {
$new_key_val = $key_vals[($index[$i] + 1)];
$key = substr($key, 0, ($position-1)) . $new_key_val . substr($key, $position, ($length - $position));
$AddDigit = FALSE;
break;
} else {
echo "<br>length=$length - position = $position<br>";
$key = substr($key, 0, ($position-1)) . $key_vals[0] . substr($key, $position, ($length - $position));
}
}
if ($AddDigit) {
$key = $key_vals[1];
for($i=1; $i<=$length; $i++){
$key = $key . $key_vals[0];
}
}
$count++;
if ($count > $max_val) {
echo $key;
return;
} else {
echo " $key,";
if (($count % 10) == 0) {
echo "<br>";
}
return keygen($key, $count);
}
}
keygen("1");
?>