I know this is a well discussed topic, but I need a fast way to hash strings. Let me explain what I'm doing and why, first.
I'm processing the URL index from commoncrawl.org, which is a mostly json based list of 2.9 billion URLs. I'm processing/parsing those files, and adding the URLs into my own database. Pretty simple process, with one catch, which is the overall size of this database. It's going to be about 225 GB, give or take. That's way too big. 1/10th of that size would easily be sufficient. Now the easiest way to trim it by 1/10th would just pick a random number form 1 to 10, and if it's 1, add it to my database. But the problem is, every time I'd run the script, I'd get different results. That is, it would be another random 10% of the total. There would be some overlap but it's not a great solution.
So I got to thinking, if I took a hash or checksum of each url, something that returned a purely integer result, I could take the last digit of that result and use that, instead of a random number, to decide what gets kept and what gets dumped.
And if I ever wanted to say, double the size of the database, it would be super easy to modify just one 'if' clause to get it to accept all '1' and '2' hashes.
Thus, I need a hash, or a checksum, or something of the sorts. Most discussions on this topic always center around "all hashes are so fast, it doesn't matter". But given the fact that every time I run this script, I'll be calculating the hash about 3 billion times, it is worth it to me to seek out the fastest possible way to do it.
CRC32 was my first thought, because it's really more of a checksum than a hash right? And it's build in PHP function actually returns the value as an integer, not a hex string. But I wanted to make sure I wasn't overlooking some vastly simpler way of doing this? Like, can I convert each character in a string to it's ascii-code equivalent, and just add them up? Or is that actually more complicated than what a CRC32 hash actually does?
Or perhaps theres a better way to do this entirely? But I like the idea that doing it this way, I'm going to get the same results on a per-url basis every time.