Forum Moderators: coopster & phranque

Message Too Old, No Replies

Web Directories - Storage of ASCII

And making it smaller.

         

brotherhood of LAN

9:09 am on May 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just finishing off some directory software I've been working on for ages, the one prob I have bones with is storing ASCII text like titles and descriptions of listings.

If it wasn't for them the DB size would be much smaller. So here is a plan: would appreciate comments, pitfalls, suggestions on it.

1) 99% of characters in a title or description are alphanumeric.
a) Convert letters into 5bits instead of 8bits to try save space. ie 32 possibilities, 26 letters, some spare bits for commas, full stops, flags etc
b) Use a "flag" i.e. 11111 to indicate a number and another 5 bits to indicate its length. i.e. A12 = 00001-11111-00010-011000
c) Use another "flag" i.e. 11110 to indicate a non-alphanumeric sequence of letters...

So roughly, a 100Mb text DB could save around 30% in space to about 70Mb by doing the above.

Would anyone advise/advise against this, or have a similar set up to reduce DB size? If I get the all clear, I'd carry out the above and use if-modified-since on all the directory categories to avoid stress on the system (just to avoid converting the binary into text every page view)

//added
I guess something like huffman is more space efficient, I probably should huff encode it instead but it's a hassle and using a fixed length set of bits will make it easier for now, and more or less has the same effect ;)

Fischerlaender

1:05 pm on May 3, 2003 (gmt 0)

10+ Year Member



I have thought of the same problem for a longer time but til now haven't written any code. My idea is similiar to yours; one thing I would test is to encode typical 2- or 3-character sequences in one byte (or in a 6bit-pseudo-byte).

If you use 6bit-charcter-encoding you have 64 possibilities. 36 of them are used for letters and numbers, several for commas, full stops and so on. This should leave about 20 possibilities for the most used sequences, which of course will depend on the language.

BTW: Experiments with gzip and bzip2 showed that both programms are making short strings even longer.

brotherhood of LAN

2:29 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Have you got down to writing the code yet? ;)

I had a blast with upper and lower alphanumberics to take 62, but needed a little more space for full stops, commas, hyphens etc, might just leave the caps out or add another bit.

I have planned to only have URL's with lower-case letters, underscores and dashes, which fitted into 32.

function fivebit2bin(&$word)
{
$search = array("'a'","'b'","'c'","'d'","'e'","'f'","'g'","'h'","'i'","'j'","'k'","'l'","'m'","'n'",
"'o'","'p'","'q'","'r'","'s'","'t'","'u'","'v'","'w'","'x'","'y'","'z'","'-'","'_'","'/'");
$replace = array("00000","00001","00010","00011","00100","00101","00110","00111","01000","01001","01010","01011","01100","01101","01110","01111",
"10000","10001","10010","10011","10100","10101","10110","10111","11000","11001","11010","11011","11100");
$word = preg_replace($search,$replace,$word);
}

function fivebit2word(&$word)
{
$len = strlen($word);
$binary = $word;
$word = '';
for($i = 0;$i < $len;$i = $i + 5)
{
$search = array("'00000'","'00001'","'00010'","'00011'","'00100'","'00101'","'00110'","'00111'","'01000'","'01001'","'01010'",
"'01011'","'01100'","'01101'","'01110',"'01111'","'10000'","'10001'"
,"'10010'","'10011'","'10100'","'10101'","'10110'","'10111'","'11000'","'11001'","'11010'","'11011'","'11100'");
$replace = array("a","b","c","d","e","f","g","h","i","j","k","l","m","n",
"o","p","q","r","s","t","u","v","w","x","y","z","-","_","/");
$word .= preg_replace($search,$replace,substr($binary,$i,5));
}
}

It's handy for listing "related categories" that don't fit into the tree by saving space, and generally saving space when listing the cat name in the DB. I doubt storing "web based" URL's will be as simple.....

Might end up using 7 bits for the title and description.....there must be a "preferred" way of storing this sort of info for a dir.

Xuefer

6:42 pm on May 8, 2003 (gmt 0)

10+ Year Member



to avoid upper case:
"This is a, and b. This is c."
compress as: "this is a, and b. this is c."
and can be auto uncompress to origin format

"He is John."
compress as "he is <switchcase>j<switchcase>ohn."
which <switchcase> is a 5bit flag to switch upper/lower case

brotherhood of LAN

7:49 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



good idea, i'll go with that.

Using 5 bits I have 4 slots left. Maybe making one of them "switch to upper/lower" would work, and having the other three for different punctuation (or maybe even to switch into different languages?)

Might be a good way to encode the body of a page too...having all the punctuation included.

Seems fast enough when trying it....wonder what this would be liker on a larger scale :)

grahamstewart

10:29 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you need this text to be searchable? If not then why don't you just gzip it before you store it? (e.g. using gzencode() or gzcompress() in php).

brotherhood of LAN

10:56 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hadn't planned to use the tables for search but thanks for the gz suggestions - something to read up on.

The whole search "thing" would be a completely seperate work, more of a word dictionary, with words matching cats bringing up the relevant cats.

Though I'd be working towards having a decent search function.....still pondering how to deal with more data if I were to spider and index the whole page, instead of a directory like annotation.

I suppose more encoding could be squeezed from the ordering of this type of info?

//added
just trying gzcompress just now
echo strlen(gzcompress("AAAAAAAAXXXXAAAAAAAA"));
now wondering if doing gz and the function in my last post might squeeze more ;)

Fischerlaender

6:40 pm on May 9, 2003 (gmt 0)

10+ Year Member



Have you got down to writing the code yet?

No, I'm sorry. It has a very low priority on my list - especially now as you are doing all the work. ;-)

grahamstewart:

Do you need this text to be searchable? If not then why don't you just gzip it before you store it?

gzip (and even worse bzip2) aren't doing well on compressing short strings. In fact, any string as short as about 200 characters isn't compressed, but will get even longer. (Example: gzipping a file containing the string "search engine" gives a "compressed" file which is 44 bytes long.)

Compressing means to use the redundancy in your data. For very short strings you have to take a detailed look and to use the redundancy which is typical for _your_ data.

grahamstewart

9:20 am on May 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For short strings your best approach is probably a dictionary based compression.

Here you do a frequency analysis on your data. Then you represent each word with a series of bits, the more common the word, the fewer bits used.

This technique is called a Huffman Encoding.

Do a bit of a Google search and you should find loads about it. Heres one I dug up [groovyweb.uklinux.net] myself.

brotherhood of LAN

12:40 pm on May 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This technique is called a Huffman Encoding.

probably should huff encode it instead but it's a hassle and using a fixed length set of bits will make it easier for now

argh! ;) Nah, if I read Huffman once more I should be able to actually write code that fits :) I guess it's really the best way after all. That or LZW, they seem to be used in most cases of compression.

I doubt it'll get any easier to code when we make flags? There probably won't need to be any I guess since the Huffman will save the space in its own coding.

I just wonder when it comes to decoding the titles/snippets of directory listings, if it will be slow or fast enough. Any experiences or hindisights there? :)

RE: short strings
I suppose having more than one encoding dictionary would help? e.g. a few dozen different encoding dictionaries, with the best-case encoding being used each time sort of thing?

The mind boggles, just to save a bit here and there, I'm sure it will become more pronounced as the directory grows. Getting prepared now is important :)