If it wasn't for them the DB size would be much smaller. So here is a plan: would appreciate comments, pitfalls, suggestions on it.
1) 99% of characters in a title or description are alphanumeric.
a) Convert letters into 5bits instead of 8bits to try save space. ie 32 possibilities, 26 letters, some spare bits for commas, full stops, flags etc
b) Use a "flag" i.e. 11111 to indicate a number and another 5 bits to indicate its length. i.e. A12 = 00001-11111-00010-011000
c) Use another "flag" i.e. 11110 to indicate a non-alphanumeric sequence of letters...
So roughly, a 100Mb text DB could save around 30% in space to about 70Mb by doing the above.
Would anyone advise/advise against this, or have a similar set up to reduce DB size? If I get the all clear, I'd carry out the above and use if-modified-since on all the directory categories to avoid stress on the system (just to avoid converting the binary into text every page view)
//added
I guess something like huffman is more space efficient, I probably should huff encode it instead but it's a hassle and using a fixed length set of bits will make it easier for now, and more or less has the same effect ;)
If you use 6bit-charcter-encoding you have 64 possibilities. 36 of them are used for letters and numbers, several for commas, full stops and so on. This should leave about 20 possibilities for the most used sequences, which of course will depend on the language.
BTW: Experiments with gzip and bzip2 showed that both programms are making short strings even longer.
I had a blast with upper and lower alphanumberics to take 62, but needed a little more space for full stops, commas, hyphens etc, might just leave the caps out or add another bit.
I have planned to only have URL's with lower-case letters, underscores and dashes, which fitted into 32.
function fivebit2bin(&$word)
{
$search = array("'a'","'b'","'c'","'d'","'e'","'f'","'g'","'h'","'i'","'j'","'k'","'l'","'m'","'n'",
"'o'","'p'","'q'","'r'","'s'","'t'","'u'","'v'","'w'","'x'","'y'","'z'","'-'","'_'","'/'");
$replace = array("00000","00001","00010","00011","00100","00101","00110","00111","01000","01001","01010","01011","01100","01101","01110","01111",
"10000","10001","10010","10011","10100","10101","10110","10111","11000","11001","11010","11011","11100");
$word = preg_replace($search,$replace,$word);
}function fivebit2word(&$word)
{
$len = strlen($word);
$binary = $word;
$word = '';
for($i = 0;$i < $len;$i = $i + 5)
{
$search = array("'00000'","'00001'","'00010'","'00011'","'00100'","'00101'","'00110'","'00111'","'01000'","'01001'","'01010'",
"'01011'","'01100'","'01101'","'01110',"'01111'","'10000'","'10001'"
,"'10010'","'10011'","'10100'","'10101'","'10110'","'10111'","'11000'","'11001'","'11010'","'11011'","'11100'");
$replace = array("a","b","c","d","e","f","g","h","i","j","k","l","m","n",
"o","p","q","r","s","t","u","v","w","x","y","z","-","_","/");
$word .= preg_replace($search,$replace,substr($binary,$i,5));
}
}
It's handy for listing "related categories" that don't fit into the tree by saving space, and generally saving space when listing the cat name in the DB. I doubt storing "web based" URL's will be as simple.....
Might end up using 7 bits for the title and description.....there must be a "preferred" way of storing this sort of info for a dir.
Using 5 bits I have 4 slots left. Maybe making one of them "switch to upper/lower" would work, and having the other three for different punctuation (or maybe even to switch into different languages?)
Might be a good way to encode the body of a page too...having all the punctuation included.
Seems fast enough when trying it....wonder what this would be liker on a larger scale :)
The whole search "thing" would be a completely seperate work, more of a word dictionary, with words matching cats bringing up the relevant cats.
Though I'd be working towards having a decent search function.....still pondering how to deal with more data if I were to spider and index the whole page, instead of a directory like annotation.
I suppose more encoding could be squeezed from the ordering of this type of info?
//added
just trying gzcompress just now
echo strlen(gzcompress("AAAAAAAAXXXXAAAAAAAA"));
now wondering if doing gz and the function in my last post might squeeze more ;)
Have you got down to writing the code yet?
grahamstewart:
Do you need this text to be searchable? If not then why don't you just gzip it before you store it?
Compressing means to use the redundancy in your data. For very short strings you have to take a detailed look and to use the redundancy which is typical for _your_ data.
Here you do a frequency analysis on your data. Then you represent each word with a series of bits, the more common the word, the fewer bits used.
This technique is called a Huffman Encoding.
Do a bit of a Google search and you should find loads about it. Heres one I dug up [groovyweb.uklinux.net] myself.
This technique is called a Huffman Encoding.
probably should huff encode it instead but it's a hassle and using a fixed length set of bits will make it easier for now
argh! ;) Nah, if I read Huffman once more I should be able to actually write code that fits :) I guess it's really the best way after all. That or LZW, they seem to be used in most cases of compression.
I doubt it'll get any easier to code when we make flags? There probably won't need to be any I guess since the Huffman will save the space in its own coding.
I just wonder when it comes to decoding the titles/snippets of directory listings, if it will be slow or fast enough. Any experiences or hindisights there? :)
RE: short strings
I suppose having more than one encoding dictionary would help? e.g. a few dozen different encoding dictionaries, with the best-case encoding being used each time sort of thing?
The mind boggles, just to save a bit here and there, I'm sure it will become more pronounced as the directory grows. Getting prepared now is important :)