Forum Moderators: coopster

Message Too Old, No Replies

Fast Hashing of Short Strings

         

l008comm

9:15 am on May 30, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



I know this is a well discussed topic, but I need a fast way to hash strings. Let me explain what I'm doing and why, first.

I'm processing the URL index from commoncrawl.org, which is a mostly json based list of 2.9 billion URLs. I'm processing/parsing those files, and adding the URLs into my own database. Pretty simple process, with one catch, which is the overall size of this database. It's going to be about 225 GB, give or take. That's way too big. 1/10th of that size would easily be sufficient. Now the easiest way to trim it by 1/10th would just pick a random number form 1 to 10, and if it's 1, add it to my database. But the problem is, every time I'd run the script, I'd get different results. That is, it would be another random 10% of the total. There would be some overlap but it's not a great solution.

So I got to thinking, if I took a hash or checksum of each url, something that returned a purely integer result, I could take the last digit of that result and use that, instead of a random number, to decide what gets kept and what gets dumped.
And if I ever wanted to say, double the size of the database, it would be super easy to modify just one 'if' clause to get it to accept all '1' and '2' hashes.

Thus, I need a hash, or a checksum, or something of the sorts. Most discussions on this topic always center around "all hashes are so fast, it doesn't matter". But given the fact that every time I run this script, I'll be calculating the hash about 3 billion times, it is worth it to me to seek out the fastest possible way to do it.

CRC32 was my first thought, because it's really more of a checksum than a hash right? And it's build in PHP function actually returns the value as an integer, not a hex string. But I wanted to make sure I wasn't overlooking some vastly simpler way of doing this? Like, can I convert each character in a string to it's ascii-code equivalent, and just add them up? Or is that actually more complicated than what a CRC32 hash actually does?

Or perhaps theres a better way to do this entirely? But I like the idea that doing it this way, I'm going to get the same results on a per-url basis every time.

Peter_S

10:50 am on May 30, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



if you want to compare the speed of the different hashing algorithms coming with PHP, you can look at this comment :

[php.net...]

Now, keep in mind that hashing will not produce necessarily a uniform distribution, even if you keeps only the last digit. So it will be "roughly" 10%, but it might be more or less. But I assume that in you case, this is not important.

Sometimes, what I do, is to take the "middle" character of a string. $string [ ( int ) ( strlen ( $string ) / 2 ) ] . (Casting it to int, so that it be a round number, equivalent to "ceil" but without the expense of calling a function). But it might not be suitable for your need.

l008comm

11:00 am on May 30, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



Given how many items I'm passing through this test, it will be statistically CloseEnough™. But also, exactitude doesn't matter here. My goal is simply to take about 10% of the total available data, with a distribution that is as random as anything else. But will give the same result every time. I'm not sure how your middle character trick would work here. I suspect, due to the not really random occurrence pattern of characters in URLs, you'll get tendencies that aren't as random as the last digit of a CRC32... that is my assumption, I have no evidence to back it up.