Forum Moderators: coopster

Message Too Old, No Replies

How to generate shortest possible string

         

impact

1:36 pm on Dec 22, 2009 (gmt 0)

10+ Year Member



Hello,

I am making url shortening site. Can any one tell me how do I generate unique short strings?

My plan is to have mysql database, when a request is made, store the mother url in the database. Generate the most smallest string, store and link it to the mother url in the database.

Using .httaccess every url of the form http://www.example.com/sfklj will be redirected to http://www.example.com/link/?url=sfklj . Then I will use mysql search to fetch the corresponding mother url and redirect.

So now, any help from any one for a PHP function which can generate unique string?

Thank you,

[edited by: dreamcatcher at 5:21 pm (utc) on Dec. 22, 2009]
[edit reason] use example.com, thanks. [/edit]

CyBerAliEn

5:06 pm on Dec 22, 2009 (gmt 0)

10+ Year Member



First, you would need a DB table to track the short URLs you already have. You seem to have this part in place.

Now, you need to add an 'extra step' in the processing when a new URL is created such that a short URL is created and used. The easiest way to do this would be to create a function/method to generate the short URL [such as: makeShortURL()]. Then, each time you create a new URL/page, you call this function appropriately such as:

$thisShortURL = makeShortURL();

This function would need to:
(a) Generate a URL.
(b) Check if the URL is already in use.
(c) Try again if the new URL is not unique.
(d) While maintaining the shortest length possible.

This brings you to some considerations.

One Approach
You could create a function that tries "a", then "b", "c", "d" ... "0", "1" ... "aa", "ab" (etc)

Basically it starts with the alphabet and increments 1 letter, then continues to the next letter, then numbers... and when no more are left it jumps to using 2 characters instead of 1. When no more combination of 2 characters are possible, it jumps to 3 characters. Etc.

How I would go about this would be to create a function ~like the following:

function makeShortURL()
{
$shortURL = NULL;
$values = array('a','b','c'---etc---'7','8','9','0');
$lastEntry = /*run a query-etc to retrieve last entry in DB for short urls in array form*/;
$lastURL = $lastEntry['shorturl']; /*get the last used "short url" value*/
$lastChar = substr($lastURL,(strlen($lastURL)-2),1);
$lastIndex = array_search($lastChar,$values);
if (array_key_exists(($lastIndex+1),$values)) { $nextChar = $values[($lastIndex+1)]; }
else { $nextChar = "{$lastChar}a"; }
$shortURL = substr($lastURL,0,(strlen($lastURL)-1)).$nextChar;
return $shortURL;
}

Note: This function is not copy-paste code, nor tested. It merely demonstrates the methodology and process you would have to follow!

The benefit to this first approach is that there is essentially zero iteration or excessive DB querying since the short URL is sequential. And since the next URL is always the next sequential step, it will always be the next shortest possibility.

Another Approach
Determine whether you want the short URLs to be all alpha, all numeric, or a combo of both. Then decide how long you want them to be.

Mathematically... Suppose you didn't mind if ALL of your short URLs were 4 characters long, then:
Alpha = 456976 possibilities
Numeric = 10000 possibilites
Both = 1679616 possibilities

I would recommend either strict alpha (ie: abgd, ijek, okis, etc) or both (ad4f, 6gy8, a36k, etc). Given that you are unlikely to need or have more than ~450k possible URLs (or ~1.7 million), you are safe to conclude that you will always have enough URLs available using strictly 4 characters.

How do you do this? I won't sketch out a function for you, but your process would be this:
1) create a random 4 character short URL --- you could do those by creating an array of possible values, and using 'array_rand' to randomly parse together a 4 character URL
2) check the DB if this new URL is being used
3) if it is available, use and set it!
4) if it is not available, try again and repeat! go back to step 1 (program-matically, you would do this with a while loop)

The issues with this approach is this: Say you do all alpha with 4 characters, that gives you say ~450k choices. Say you know you will use ~100k (unlikely). That means you have a ~25% chance of randomly creating a URL that is already in use each time you run the loop. Well, if you match, you run the query again. Thus the issue is that the closer you get to your "maximum # possibilities" the more DB usage you will experience.

In a realistic sense, this is not an issue. Largely because, you would still have a decent chance of beating ~25% odds assuming these URLs are not created on a rapid/continuous basis. But if you run an average site where these short URLs reflect pages... you'll only have a few dozen to a few thousand of these short URLs, so you would be VERY safe with this method (generally low resource usage and more than enough possibilities to meet demand).

The other issue with this method is suppose you do need or use more than 450k short URLs (why would you?)... you would eventually hit a ceiling where no more possible URLs exist. Then you would need to increase the number of characters allowed (from 4 to 5/6/7/etc). But you would WANT to avoid this. If you know you will need "X" amount of URLs, make sure you set it up to be able to handle significantly more than "X" possibilities.

Hopefully this can get you on track. It is reasonably easy and I have provided you the "knowledge" to do what you want... you only need to code it now. :)

impact

5:22 pm on Dec 22, 2009 (gmt 0)

10+ Year Member



Wow, this is a great help.

Thank you very much. In mean time, I made some search and found this

<snip>

This is post says that "Now unlike Database servers: webservers are easy to scale so you can let them do a bit of converting to ease the life of your users, while keeping your database fast with numbers (MySQL really likes them plain numbers ; )."

Is it really that, when mysql have to search through entire database it would prefer only INT instate or INT and STRING combination or just STRING.

Thank you,

[edited by: dreamcatcher at 4:55 am (utc) on Dec. 25, 2009]
[edit reason] No Urls Please. See TOS. [/edit]

CyBerAliEn

5:49 pm on Dec 22, 2009 (gmt 0)

10+ Year Member



My understanding is NO (~).

Ultimately, every character (whether alpha, numeric, or symbol) is represented by a unique 8-bit combination --- a '1' or a 'a' is really no different to the computer. Now it is possible that the mySQL engine has underlying architecture that might make that ~fuzzy... but I would be willing to bet that it would take A LOT of usage/data to see a difference between "aaaa" and "1111" type formats. [and even then, I would bet the differences are negligible until you reach super scales]

I think the idea behind the URL you posted is that you can provide the "public" a simple short string like "akdi" and the server then turns this "combined" identifier into a integer identifier (ie: 33944404433 etc). Likely for the purpose that you can set the database ID as integer and run the query against it.

To me, it seems like an overly complicated method of trying to be efficient when in reality, there is unlikely to be any efficiency gains.

When you consider having thousands or millions of these "identifiers" in your DB, turning entries like "ackd" into "99449494933328229999992" takes up two to three times as much disk space. That's a negative in my mind (however small). And part of me wonders if the "efficiency savings" might just be drowned out by going from a ~4 character string to a ~16 character+ integer.

If I was doing it... I would supply publicly the 'abcd' short URL and have a table that stores the URL as that. When someone requests that URL, you query the DB for that string. Simple.

Assuming you match your server(s) configuration to match your demand, you should be fine. (ie: dedicated server, multiple servers, etc... all dependent on demand)

impact

4:33 am on Dec 25, 2009 (gmt 0)

10+ Year Member




$length = 4;
$random_id = get_rand_id($length);

// Open database connection
createDatabaseConnection();

// Check if the random number already exist
$getRandomID = "SELECT * FROM shortURL WHERE randomID = '$random_id'";
$getRandomID2 = mysql_query($getRandomID) or die (mysql_error());
$getRandomID3 = mysql_fetch_array($getRandomID2);

while ($row = $getRandomID3['randomID']){
$randomID = $row['randomID'];
if($randomID == $random_id){
$random_id = get_rand_id($length);
}
}

I am bit confused. Do you think this is the best way to perform this while loop to check if the randomly generated 4 digit code is already exist?

Any help will be appreciated. Thank you.

FourDegreez

6:17 pm on Dec 26, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I use this function (inside of a class called Util with static variable $ch):

public static function shortURLGen($len=5) {
  if (!isset(self::$ch))
    self::$ch = array('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0');
  for ($i = 0; $i < $len; ++$i)
    $str .= self::$ch[rand(0, 61)];
  return $str;
}

By using mixed-case letters, a 4 char alphanumeric string will allow for over 14 million combinations. I use recursion to find an unused short URL. Something like:

function getUnusedShortURL() {
  $shortURL = Util::shortURLGen();

  //call to db here to check if it is used
  if ( //it is used )
    return getUnusedShortURL();
  else
    return $shortURL;
}

In the pseudocode above what I'm doing is getting a short URL, checking the DB, if it's in use then the function calls itself again to repeat the process, otherwise it returns the short URL.

BTW, what I'm calling $shortURL is really just the random string, not the full URL (sorry if that was causing confusion).