How to manage a huge list of URL ?

Forum Moderators: open

Message Too Old, No Replies

How to manage a huge list of URL ?

ROLAND_F

10:40 pm on Feb 13, 2002 (gmt 0)

I've heard that hashing is the best technique to have quick access to data and that you have to write a function that take your data as input and return an index value. So your hash function can take for example an URL and return an ID that you can match more quickly in a database for example.
I'm looking for a hash function that give good result even if you have millions of url.

If you could help me to find this kind of hash it would be great.

Brett_Tabke

12:59 pm on Feb 14, 2002 (gmt 0)

Yes, some form of hash key is always going to be faster. Basically, what you do is "compress" whatever you are inputing into a "close to unique" id key. Then use that key to lookup in your db.

Depending on the data, sometimes it is as simple as adding the bytes together. Say you have urls to work with. Add the ascii values together and use as a hash. After you do a search for that hash, you compare the real deal with the db results and you find the exact matches.

Other times, you can do things like eor (exclusive or) each byte with the next.

There are all kinds of simple tricks like that you can use to generate a hash key. It doesn't have to be a perfect match, but it is far quicker to search with a one-??? byte key and get 100 matches and sort through them, than it is to try to compare a url to a million in a db at once.

For example:

Urls contain all letters of the lower english A-Z alphabet (26 chars)+ the dash char, the slash, period. That's 29 chars.

If we account for question marks, extra path slashes, and maybe the underline char used in filenames. That gives us a total of 32 characters to deal with.

256/4 = 32

Meaning, we can store 4 characters of a url in 1 byte. Or, we can generate a unique hash key for any domain name on the net, in 8 bytes. (63chars max/8=8+1 hash char for the tld)

So now instead of matching up to 64 chars, we can instead search for matches to only a maximum of 8 characters (less if the domain is shorter). Of course, it's a url, so we will have to deal with more characters in the end, but you get the idea - compress it into a hash value.

Further, if we generate a hash key index file out of the database urls, sorted by value, we can check for a match in very few comparisons.

The index file would be:

hash value : db id pointer

It would be arranged numerically by the hash value. Some db's take it the next step further and build index files into their index files.

So, we have a url. We compress it down into a hash. We compare it with the first key file. It says that hash values in that range should be in hash key file 5. We go to hash file five an look for a match to our hash value. If it matches, we then dig out the matches from the db and compare the raw urls to see if it's a perfect match.

At one time, I had it figured that if you have 1 million urls, you can determine if your url is in the db in 8 comparisons.

Tapolyai

1:47 pm on Feb 14, 2002 (gmt 0)

A long long time ago I worked out a simple database (for dictionary lookup). There might be an additional way to compress the top level domains since they are so specific. I have further worked on this by using RLE on the data elements then adding the branch. The only problem was the escape character in my research since all 256 characters might appear. In URLs this is not an issue since many characters are invalid which can be used as escape. Let me know if you implemented any of this.

This is how it works (note that this is a clip from a 1992 posting I made in comp.compression):

The
whole database is structured as a multi-branch tree. Each
node on each branch can contain the following : a
character, a record #, and a pointer to a list of branch
pointers. This compression/dictionary system relies on the
fact that all words repeat part of other words (like every
other compression does). Lets say our dictionary contains
the following words: CONTROL, CONTROLLER, DATA, DEAD, APPLE,
APARTMENT, CAT, CAT. The tree would look like this :

Root
/ �\
/ � \
A /C D\
P \ /A O A E\
A P\ T# N T A\
R L\ T A# D#
T E# R
M O
E L#
N L
T# E
R#
Since there are 26 possible starting characters the root
will have 26 nodes maximum. At the beginning, while building
the tree there will be a lot of new branches, but later on
all possibilities, that is standard for a dictionary, have
been exhausted, and no more new characters will be added.
The # refers to the record numbers, for example in a
thesaurus. The worst case tree would be enormous.. it
would equal to 26 to the power of the longest word length,
but statistics steps in and saves this tree. How often do
you have a word with X following a Q? or ZXCS? or
GHDDFSXLDFKIW? Not too often; not in my dictionary at
least.

You can further increase the compression if you forego the
record numbers for extra space. If you would have Say a
word DEADCAT ( not that there is ), instead of adding the
CAT again to the end of the word DEAD, we can have a pointer
pointing back to the beginning of the word CAT. These
pointers are called wines by C&W. Unfortunately I have ran
into a problem at this point that if the CAT is actually has
a continuation of ALOG, making it into CATALOG, it also
allows for the validation of DEADCATALOG, which might not be
a valid word in the dictionary.

You have to remember, this statistical method is only valid
if you can count on the fact that some combinations will
never appear, and that the data is repetitious. Also, the
compression increases as the size of the dictionary
increases. The more words you add, the better the
compression gets! I figured with a 60,000 word dictionary
you can compress a single character under 0.04 bits! Of
course you might still have the letters like Z, and X, where
compression would lack, but better than 8 bits.

Tapolyai

1:52 pm on Feb 14, 2002 (gmt 0)

Hmmm... even with <fixed> all the spaces were taken out from the tree... To see the original post look here:

[groups.google.com...]