Forum Moderators: coopster
After looking for ways to convert words into smallish numbers in a quick way, crc32 caught my eye. Looking at the PHP manual, or a Google search for "crc32", I couldnt find much info or background on it, just diff. versions for different languages.
I'm wondering if crc32 is safe to use, I've heard about different words "colliding" into the same number, and issues like this, so I'm not sure to carry on using it and pay the price later.
Since I can't find a way to convert a crc32 checksum back into a word, I assume that these collisions do happen, or else there'd be an option to convert the number into a word :)
Any info,experiences or folklore about crc32 would be appreciated.
CRC is "Cyclic Redundancy Check" (or Code). It is not a compression method per se, but rather an error-checking method. For the simplest example of this class, you could take even parity checking, where the bits in a binary word are simply added together modulo-2, and the result - a 1 or a 0 - is appended to the data. The result of this is that the new longer data word is always even, and this can be verified. These codes are used to help verify that data is stored and retrieved, or sent and received, correctly - to varying degrees of certainty.
With parity, EDC, CRCn, LRC, and VRC, there is no way to reconstitute the original data, since - as you suspected - the chances of two documents resulting in the same check code range from moderate to high. The resulting code does not contain enough information to reconstitute the original data, only enough to verify to a high degree of certainty that it was retrieved successfully.
Check out ZIP, GZIP, and compression techniques of their type - They are more suited to your needs.
HTH,
Jim
>Check out ZIP, GZIP, and compression techniques of their type
Yes thanks, in the long run this is probably where I will end up. From your answer and what I've read about huffman/LZ, if compression is the way to go then this is what I will need.
The pages I'm reading/storing are being stored in a mysql database (probably not practical for what I need to do), but in the meantime it's handy if the rows are fixed length numbers to do some comparisons of word densities across a page/domain sort of thing.
Below is a table of a part of a spiderd page. There are 3 columns, one for pageID, one to indicate a new paragraph and one for the crc32 checksum.
pageid positionid wordid
1 NULL -1242291485
1 -2017851279
1 -1742391944
1 1657866020
1 143088812
1 NULL 1011183078
1 1282654113
1 1182287066
1 -522636464
After indexing page by page, word frequency, phrases, sentence lengths etc etc can be measured, and fairly quickly too since they are all INT or NULL fields.
The beauty of it just now is when I'm trundling through a queue of pages to be stripped and indexed, no external reference is needed to make a wordid (by using crc32). This makes the whole process quite quick.
When all the word frequencies etc are done, I'd hope to use some of the compression techniques you mentioned, for instance ordering the wordid checksums and delta encoding them (with pointers to relevant docs).
Perhaps there is another way to turn words into INT's?
Thanks for the answer JD.
My background is in hardware and I've implemented many of these error-checking codes in logic circuits, but I'm not very strong in scripting languages; others may be better able to answer your question.
Jim
alongside the suggestions of hash functions (md5 or something like that?) someone has suggested using the ASCII character codes as a way to generate a unique number.
I've thought about doing it this way. Using a-z as a 5 bit code and storing sets of 6 characters as a 4 byte INT. Any words longer than six chars looks up a seperate table.
The plan is to keep these words as numbers for a dictionary lookup, comparing of words, phrase creation etc within a large set of data, so using numbers instead of ASCII is really the goal. Seems there are a few ways to go about doing this, although a 4 byte INT is longer than the word "and" hopefully in a larger table the INT is faster to lookup.
If only the alphabet was in base 10 this would be alot easier I guess :)