Forum Moderators: coopster & phranque

Message Too Old, No Replies

ASCII Primary Key Generator - Automated Script

In look of one to create v.small primary keys

         

brotherhood of LAN

7:49 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pretty much what the title says :)

An ASCII character primary key, where the lowest number of characters are used for starters, then put together, then in 3's etc until all combinations run out......basically the smallest primary keys possible. It's to be used for "word id's" in a data dictionary.

The dictionary has already been sorted out into word frequency, so all i need is a script to do the necessaries with the ASCII primary key.

Can anyone offer some tips as to how to go about this? :)

Nick_W

8:00 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As you know, I'm a bit thick....er....I don't get it!

Why wont an AUTO_INCREMENT key work?

Nick

brotherhood of LAN

8:06 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Because an auto_increment id is small, but not as small as an ASCII primary key? :)

Nick_W

8:20 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, can't see that it would make that much difference... We'll have to wait to see what the others have to say...

Nick

ergophobe

9:55 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BOL,

What do you mean by small? Small for the database, or small when transferred over the net? If the latter, I guess you're right. If you mean the former, then you're wrong.

When you say ASCII, I assume that you mean "char".

0-9, a-z, A-Z = 1-62 = for 1 char
00-99, 0a-9a, 0A-0Z... = 1-62^2 = 3844 for 2 chars
000 - ZZZ gives you 62^3 = 238328 for 3 chars

That gives you 242234 for 1-3 chars.

In terms of transfer over the net, you have some advantage in that three bytes let you transfer a key as large as you could with six bytes using integers. Of course, optimizing *one kb* off of an image makes up for 333.3 primary keys in integers/vs ASCII. Eliminate indents from your pages and you will likely gain more unless you are making zillions of queries and they all send different primary keys over the net. Generally you should only have to send the key a few times (one or two per link).

In terms of DB storage requirements, you actually lose and are wasting space and slowing things down.

If you use col type char, that's 3 bytes / record for the primary key (or 2-4 bytes if you use varchar).

However, MEDINT also takes up three bytes and lets you store numbers up to 16777215. To make that easier to read

16777215 with medint
242234 with 3-col
65535 with smallint (2 bytes)

Both take up three bytes. You have a huge advantage with integers since each byte gets you roughly 256^(number of bytes) while each ASCII char only gets you rougly 62^(number of bytes).

Tom

brotherhood of LAN

10:16 pm on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see, I stand corrected Nick :) It seemed so real when I posted it.....

This is the situation....

The keys will be a primary key for word id's, for a web directory structure, where the filenames/titles/keywords etc will be substituted with the above word id's.

Think of a huge category directory structure and substitute all unique words with numbers. Does the scale make a difference?

The thing is, I've been reading about Huffman coding and related stuff....its just to make a streamline yet large directory.

The maths you posted was something I just start to play around with on excel after I found out how to concatenate :) Not so sure this has its use now after what you said.....

I assume you meant an integer would be more wise in a mySQL type db? Would it be different if it was flatfile...ie the chars you mention 0-9 a-z A-Z...I just thought that each character took up an equal amount of space...and using integers meant that there would be less (smaller) variations to use.

andreasfriedrich

11:50 am on Sep 11, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The question of whether such an approach is advisable or not aside Iīm surprised that nobody gave any hints on how to generate such a key or posted any code. There surely must be quite a few members who have a degree in CS and who would certainly have done something like that at university. To me, a lawyer, this problem seems like a nice, tricky CS problem. I donīt own Donald E. Knuthīs Art of Computer Programming but wouldnīt be surprised if such an algorithm could be found there.

If I understand brotherhood_of_LAN correctly than what he wants is a script that produces something like A - B - AA - AB - BA - BB - AAA - AAB - BBA - BBB - BAA - BAB - ABA - ABB - AAAA - AAAB - BBBA - BBBB - BBAA - BBAB - BABA - BABB - BAAA - BAAB - ABBA - ABBB - ABAA - ABAB - AABA - AABB - AAAAA - AAAAB - BBBBA - BBBBB - BBBAA - BBBAB - BBABA - BBABB - BBAAA - BBAAB. I use such a function in my cms to generate the id attributes of certain elements used for DHTML.

Hereīs the code I use:

#!/usr/bin/perl 

use POSIX;
use strict;

my @abc = qw(A B);
for (my $j=0;$j<40;$j++){
print get_index($j), " - ";
}

sub get_index {
my $n = (shift)+1;
my $b = @abc+0;
my ($c, $i) = 0;
my $ind = '';
# How many slots do we need?
for ($i=1;$c<$n;$i++) {
$c += $b**$i;
}
$i--;
# Calculate slots from left to right
for ($i;$i>1;$i--) {
my $iv = $b**($i-1);
my $x = ($n-$iv)/$iv;
$x = ceil($x)-1;
$ind .= _get_slot($x);
}
# Calculate last slot
return join '', $ind, _get_slot($n-1);
}

sub _get_slot {
my $n = shift;
$n = 0 if $n < 0;
my $b = @abc;
return $abc[$n - ($b * (int($n / $b) + 1))];
}

Comments are welcome. If you have a more elegant solution, please share it with us.

ergophobe

1:52 am on Sep 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops. I got sidetracked by the size issue.

More elegant? I don't think so, but it is quite different. My solution
- uses PHP not Perl
- uses recursion not looping
- allows custom lists of "digits"

Recursion is the obvious solution for this problem (in fact, probably the only solution for a real Huffman tree) and would work really well in Scheme or something, but it I'm not sure PHP does recursion well. My quick check shows, however, that memory usage is low, so maybe it's not that bad.

By the way, There is a good section on Huffman encoding [mitpress.mit.edu] in Abelson and Sussman's SICP [mitpress.mit.edu] (which many people consider the best programming book ever written and which is now entirely available on the web). This is a book worth owning in print though.

[edited by: jatar_k at 9:04 pm (utc) on Sep. 13, 2002]
[edit reason] buggy version of script removed as per author. See below for correct version [/edit]

andreasfriedrich

1:12 pm on Sep 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the link to the book by Abelson and Sussman. It seems to be an interesting read.

- allows custom lists of "digits"

Thatīs what the @abc array is for.

my @abc = qw(a r o n 1 4);
would use just those "digits".

- uses recursion not looping

Thatīs what I wanted to avoid.

When I ran your script, it printed the following code which does not seem to be quite right.


3, 4, 5, 6, 7, 8, 9, a, b,
c, d, e, f, g, h, i, j, k, l,
m, n, p, q, r, s, t, u, v, w,
x, y, z, A, B, C, D, E, F, G,
H, I, J, K, L, M, N, P, Q, R,
S, T, U, V, W, X, Y, Z, , 1, <--- no digit gets added
3, 4, 5, 6, 7, 8, 9, a, b, c,
d, e, f, g, h, i, j, k, l, m,
...

ergophobe

8:09 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Damn! Sorry about that. It was working perfectly, but I pasted the wrong version in. The correct version follows.

Andreas, I really am curious about your "aversion to recursion" (couldn't resist the alliteration) but I don't think the discussion belongs here, so I opened a new thread on it. I would appreciate it if you checked it out in the new thread on recursion [webmasterworld.com]

<?

function keygen($key, $count=1) {

// array allows custom vals - for example can omit lowercase "l"
// as well as "o" and "O" to avoid confusion with numbers 1 and 0).
$key_vals = array("0", "1",
"2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");

$max_val = 500; // number of keys to generate

$length = strlen($key);
$AddDigit = TRUE;

for ($i=0; $i<$length; $i++) {
$index[$i] = array_search(substr($key, -($i+1), 1), $key_vals);
$position = $length-$i;

if ($index[$i] < (sizeof($key_vals) -1)) {
$new_key_val = $key_vals[($index[$i] + 1)];
$key = substr($key, 0, ($position-1)) . $new_key_val . substr($key, $position, ($length - $position));
$AddDigit = FALSE;
break;
} else {
echo "<br>length=$length - position = $position<br>";
$key = substr($key, 0, ($position-1)) . $key_vals[0] . substr($key, $position, ($length - $position));
}
}
if ($AddDigit) {
$key = $key_vals[1];
for($i=1; $i<=$length; $i++){
$key = $key . $key_vals[0];
}
}

$count++;
if ($count > $max_val) {
echo $key;
return;
} else {
echo " $key,";
if (($count % 10) == 0) {
echo "<br>";
}
return keygen($key, $count);
}
}

keygen("1");

?>

andreasfriedrich

10:40 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That looks far better ;)

Nice and enlightening discussion btw.