Forum Moderators: phranque

Message Too Old, No Replies

mysql - SET column type

How to avoid redundant information when used for bilingual website

         

andreasfriedrich

7:12 pm on Sep 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wow, after 142 posts this is my first very own thread ;)

Suppose there is an accomodation database which had a column to store the type of accomodation in. There are hotels, b&b´s, lodging houses, appartments, farm holidays, campsites etc. And there are Hotels, Frühstückspensionen, Pensionen, Bauernhöfe, Campingplätze. I.e. the type of accomodation needs to be stored in different languages. Neither type is mutually exclusive.

Now my question is this:

Do I do my own bit bashing and de- and encoding on storage or retrieval?

Do I use said SET column type which does the bit bashing for me at the price of having to store redundant information in my database.

I know I can do it both ways. But what are the implications when it comes to speed, maintainability, etc?

jatar_k

5:22 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



andreas, I have looked at this and puzzled over it since you posted it. Driving a site with one multi language db is tough.

I think the way I would personally do it would be to have some kind of translation table and store only numeric values for each accomodation type.

I was thinking about accomodation types being numbered 1-10 and then depending on what lang the part of the site is in it could use the de_type table or the en_type table, for example. On selects you could maybe do joins with the appropriate lang table and maybe this would be the quickest.

It really is a tough situation. Maybe you can carry on from there and we can see what ideas other people might have.

andreasfriedrich

10:32 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for thinking about this and taking the time to reply jatar_k.

I was thinking about accomodation types being numbered 1-10 and then depending on what lang the part of the site is in it could use the de_type table or the en_type table, for example. On selects you could maybe do joins with the appropriate lang table and maybe this would be the quickest.

This approach would work quite well if the relationship between company and accomodation type would be 1:1. With the two languages added to the picture this would result in a 1:n relationship.

main table 
id ¦ name ¦ type
----------------
01 ¦ wxyz ¦ 0001
02 ¦ abcd ¦ 0002

en_type table 
id ¦ name
----------
01 ¦ hotel
02 ¦ b&b

de_type table 
id ¦ name
----------
01 ¦ Hotel
02 ¦ Frühst

However, since the relationship between company and accomodation type is 1:n we are left with a n:n relationship. We would have to use a separate table.

main table 
id ¦ name
---------
01 ¦ wxyz
02 ¦ abcd

company to types 
id_comp ¦ id_type
-----------------
0000001 ¦ 0000001 // company 1 is hotel and b&b
0000001 ¦ 0000002
0000002 ¦ 0000001 // company 2 is just hotel

en_type table 
id ¦ name
----------
01 ¦ hotel
02 ¦ b&b

de_type table 
id ¦ name
----------
01 ¦ Hotel
02 ¦ Frühst

I´d consider that pretty messy just for the sake of storing a dozen names in two languages.

we can see what ideas other people might have.

I´d really appreciate it if somebody could comment on that and perhaps the implications on speed, maintainability, etc.

ergophobe

10:11 pm on Sep 18, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I´d consider that pretty messy just for the sake of storing a dozen names in two languages.

Isn't that what a relational database is for? I'm so far from an expert that it isn't even funny, but I've done a some work on a multi-lingual shopping cart and some other sites that use a lot of such tables.

SPEED - In terms of speed, it benchmarks pretty well if your queries are well-structured (use the MySQL EXPLAIN and also turn on the MySQL logging logs while developing and look at the logs to see which queries are hurting you). The query on the company_to_type table should be super fast as long as you index the company col (no point in indexing the type col as there will be too much redundancy) and so should the query on the types tables since they are integer queries on the primary key, not to mention that they are small tables.

MAINTAINIBILITY with many tables - I think it's better than the alternatives. Lets say you want to track user ratings of given hotels. Users rate them on a scale of 1-10. You track number of users and total points and use that to calculate rating. Now you work on a local copy of the site and want to upload new hotel data. If the rating info is in the same table as the hotel info, you have to synchronize the local and live tables. If they are separate, you simply overwrite hotel info table and the rating info take care of itself.

One small thing, I would suggest the table names
types_en
types_de

rather than en_type, de_type, since I think it's usually easier if the names are such that related tables sort together during alphabetical sorts.

Also, depending on what you're trying to do, you don't necessarily need a table for each language (this is the OSCommerce way; not sure whether it's better or worse)

main table
id ¦ name
---------
01 ¦ wxyz
02 ¦ abcd

company to types
id_comp ¦ id_type
-----------------
0000001 ¦ 0000001 // company 1 is hotel and b&b
0000001 ¦ 0000002
0000002 ¦ 0000001 // company 2 is just hotel

hotel_type table
id ¦ type ¦ language ¦ name
----------
01 ¦ 0001 ¦ 00000001 ¦ hotel
02 ¦ 0001 ¦ 00000002 ¦ hotel
03 ¦ 0002 ¦ 00000001 ¦ B&B
04 ¦ 0002 ¦ 00000002 ¦ Fruhst

languages
id ¦ language
----------
01 ¦ English
02 ¦ Deutsch

OSCommerce does this to store things like product descriptions which are only used 1x/language and are frequently set/changed/updated by regular employees rather than admins.

For interface items that the user would not typically change, they often use files with constants and the DB type is a constant name. So in PHP/MySQL, you have a constant in the default language

HOTEL
B_AND_B

And then you have files for each page that get included based on the language. So you can have
include("hotel_details/" . $language . "/hotel_details_words.php/");

hotel_details_words.php is just a list of constants like
DEFINE(B_AND_B, "Frühst");

I think the DB approach that just does everything using the power of the relation DB is much easier overall.

Tom

andreasfriedrich

12:22 am on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for your interesting thougths Tom. I agree with you that the database approach is the easiest way.

Isn't that what a relational database is for?

I guess you are right. And if there were hundreds or thousands of entries in the types table I probably wouldn´t have batted an eye about the whole setup.

I like your point on maintainability with many tables. Again this seems to be the whole point of normalizing your database.

Implementation I choose

There is just a type column of type int in the main table. On retrieval it is passed to the following function:

function make_description($data, $array) { 
$s = array();
foreach($array as $key => $value) {
if($data & $key) {
array_push($s, sprintf("<li>%s</li>", $value));
}
}
return '<ul>' . implode('', $s) . '</ul>';
}

$data is the number contained in the db. $array is the array contained in $lang['typ']['de'], $lang['typ']['en'], etc. as seen below.

$lang['typ']['de'] = array(1<<0 => 'Hotel', 
1<<1 => 'Pension',
...
1<<6 => 'Campingplatz');

This way I just have to do one simple query and get nicely formatted html output very easily. While I do realize that usually (if there are a lot of data involved) it is a good way to let the RDBMS do as much work as possible, the situation at hand seems different and IMO warrants a different approach.

Andreas

ergophobe

2:06 am on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Again this seems to be the whole point of normalizing your database.

Actually the example I gave is sort of denormalized in order to increase maintainability. If it were normalized, each record would have only information that is uniquely related to the primary key, and all information that is uniquely related to the primary key. My idea is actually to violate that in order to make the whole thing easier to synchronize.

Out of curiousity, how many classifications are you going to have? Will users have the ability to make new types? If only the programmers will be adding types and it's only a few (say 20 or so), you could just put them straight into an array with no DB call.

$type_array = array(
"en" => array("1" => "Hotel", "2" => "B and B", ...)
"de" => array ("1" => "Hotel", "2" => "Fruhst", ....)
);

$type = $type_array[$lang][$type_num];

Can't get much faster than that! I use the associative index "1", "2", etc to override the automatic indexing so that if someone deletes an array element, it should still work. I believe that PHP is smart enough to avoid conflict if you have integers or strings that evaluate to integers as associative indices.

Tom

andreasfriedrich

1:33 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When I was saying that this seems to be the whole point of normalizing your database I was referring to the following statement of yours:
If the rating info is in the same table as the hotel info, you have to synchronize the local and live tables. If they are separate, you simply overwrite hotel info table and the rating info take care of itself.

Your second proposal comes close to my chosen implementation. I´m using bit operations to store more than one type (hotel and appartments) in one number. This comes at the price that I have to loop through the types array and bitwise AND the type number with the keys.

From your example I am not sure if you were intending to do the same. Would you still use the company2types table to account for the fact that one accomodation could be of more than one type? Then you could in fact do a simple lookup in the type hash.

There will be less than 20 types and the users cannot add new types. I will be using the same approach for other types of classifications too: offered services, leisure activities, etc.

Andreas

ergophobe

5:58 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Would you still use the company2types table to account for the fact that one accomodation could be of more than one type? Then you could in fact do a simple lookup in the type hash.

That's basically what I was suggesting, yes. You still need a mechanism for efficiently tracking all types of accomdations at each place and that seems like the simplest solution.


I´m using bit operations to store more than one type (hotel and appartments) in one number. This comes at the price that I have to loop through the types array and bitwise AND the type number with the keys.

I wonder how much you really save by eliminating multiple queries and then parsing the integer bitwise. I'd be curious to benchmark it against something like...

--- settings.php ---

define('DEFAULT_LANG', 1);
$lang_array('en' => 1, 'de' => 2);

if ($HTTP_GET_VARS['lang'] && in_array($HTTP_GET_VARS['lang'], $lang_array) {
$lang = $HTTP_GET_VARS['lang'];
} else {
$lang = DEFAULT_LANG;
}

$type_array = array(
"en" => array("1" => "Hotel", "2" => "B and B", ...)
"de" => array ("1" => "Hotel", "2" => "Fruhst", ....)
);

--- show_page.php ----

include(settings.php);

$query = "SELECT type_index as ind
. FROM accom_type at, companies c, companies_2_types c2t
. WHERE c.id = $company_id AND c2t.company_id = c.id AND c2t.accom_id = at.id";

$result = mysql_query($query);

$accomodation_options = "";

while ($type = mysql_fetch_array($result)) {
$accomodations_options .= $type_array[$lang][$type['ind']];
}

---------------

When I was saying that this seems to be the whole point of normalizing your database I was referring to the following statement of yours:


If the rating info is in the same table as the hotel info, you have to synchronize the local and live tables. If they are separate, you simply overwrite hotel info table and the rating info take care of itself.

I understood that. My point was that in the case you are quoting, the tables are actually denormalized. Usually people think of denormalized tables as combining tables that should be split. The example that I gave in fact splits tables that should be combined in order to be normalized. Since the page rating relates only to the primary key and to nothing else, full normalization would ask that both tables be one, with page_id as the primary key. I have denormalized it in order to facilitate synchronizing. Rather than normalizing and then synchronizing using SQL, I'm suggesting denormalizing and then just overwriting the part of the database that is no based on user input, but only designer input.

Cheers,

Tom

[edited by: Marcia at 6:27 pm (utc) on Sep. 19, 2002]

andreasfriedrich

6:17 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Tom,

now I got your point on the normalization thing. And you are right of course.

I will benchmark your proposed code against mine and let you know what I came up with.

Thanks for your time

Andreas

ergophobe

8:08 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That would be great if you get a chance to post the results - it would be nice to know. Some solutions may scale better than others, so it can't be a definitive answer, but it would still be nice to see what you find.

Cheers,

Tom

andreasfriedrich

9:57 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Results

Each accomodation was processed once in random order. The figures show the average time for processing one accomodation.

multiple queries..: 0,00166389942169 seconds 
bitwise comparison: 0,00118190050125 seconds
--------------------------------------------
difference........: 0,00048199892044 seconds

Test Script

<?php 

require_once("shared/DatabaseConnection.inc.php");
require_once("shared/Functions.inc.php");

function getmicrotime($t) {
list($usec, $sec) = explode(" ",$t);
return ((float)$usec + (float)$sec);
}

function populate() {
$t = <<<END
CREATE TABLE companies (
id int(10) unsigned NOT NULL auto_increment,
name char(50) default NULL,
mytype int(10) unsigned NOT NULL default '0',
PRIMARY KEY (id)
) TYPE=MyISAM;

CREATE TABLE companies_2_types (
id_comp int(10) unsigned NOT NULL default '0',
id_type int(10) unsigned NOT NULL default '0',
KEY id_comp_index (id_comp)
) TYPE=MyISAM;

CREATE TABLE accom_type (
id int(10) unsigned NOT NULL auto_increment,
type int(10) unsigned NOT NULL default '0',
PRIMARY KEY (id),
KEY type_index (type)
) TYPE=MyISAM
END;

$x = array(
array(
'name' => 'Aaron',
'type' => array(1,5),
),
array(
'name' => 'Nick',
'type' => array(1,4,8),
),
array(
'name' => 'Zack',
'type' => array(2,3),
),
array(
'name' => 'Tay',
'type' => array(6),
),
array(
'name' => 'Isaac',
'type' => array(5,7),
),
array(
'name' => 'Jesse',
'type' => array(2,4,6),
),
array(
'name' => 'Matt',
'type' => array(1,3),
),
array(
'name' => 'Chris',
'type' => array(5,6),
),
array(
'name' => 'Leo',
'type' => array(3,5),
),
array(
'name' => 'Gil',
'type' => array(2,7),
),
);

foreach(explode(';', $t) as $sql) {
my_db_query($sql);
}
foreach($x as $key => $value) {
$mytype=0;
foreach($value['type'] as $type) {
$mytype += 1<<($type-1);
}
$sql = "INSERT INTO companies SET name='$value[name]', mytype=$mytype";
$id_comp = my_db_query($sql);
foreach($value['type'] as $type) {
$sql = "INSERT INTO companies_2_types SET id_comp=$id_comp, id_type=$type";
my_db_query($sql);
}
for($i=1;$i<9;$i++) {
my_db_query("INSERT INTO accom_type SET type=$i");
}
}
}

#populate();exit;

$lang = 'en';

$type_array = array(
"en" => array(
"1" => "Hotel",
"2" => "B and B",
"3" => "Campsite",
"4" => "Appartments",
"5" => "Country House",
"6" => "Pub",
"7" => "Boarding House",
"8" => "Shack",
)
);

$company_ids = range (1,10);
srand ((float)microtime()*1000000);
shuffle ($company_ids);

$start = microtime();

while (list (, $company_id) = each ($company_ids)) {
$query = "SELECT type as ind
FROM accom_type at, companies c, companies_2_types c2t
WHERE c.id = $company_id AND c2t.id_comp = c.id AND c2t.id_type = at.id";
$result = mysql_query($query);
$s = "";

while ($type = mysql_fetch_assoc($result)) {
$s .= $type_array['en'][$type['ind']];
}
}

$end = microtime();
$t1 = (getmicrotime($end) - getmicrotime($start))/10;

echo "multiple queries: $t1 seconds<br>";

$lang = array();
$lang['typ']['de'] = array(1<<0 => 'Hotel',
1<<1 => 'B and B',
1<<2 => 'Campsite',
1<<3 => 'Appartments',
1<<4 => 'Country House',
1<<5 => 'Pub',
1<<6 => 'Boarding House',
1<<7 => 'Shack');

reset($company_ids);

$start = microtime();

while (list (, $company_id) = each ($company_ids)) {
$sql = "SELECT mytype FROM companies WHERE id='$company_id'";
$result = mysql_query($sql);
$type = mysql_fetch_assoc($result);

$s = "";
# while (list ($key, $value) = each ($lang['typ']['de'])) {
foreach($lang['typ']['de'] as $key => $value) {
if((int)$type['mytype'] & $key) {
$s .= $value;
}
}
# reset($lang['typ']['de']);
}

$end = microtime();
$t2 = (getmicrotime($end) - getmicrotime($start))/10;
$td = $t1 - $t2;

echo "bitwise comparison: $t2 seconds<br><br>difference: $td seconds";
?>

Uncomment populate line to build tables and insert data.

ergophobe

10:15 pm on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cool! As a percentage, that's quite a big difference. In absolute terms I guess it depends on how many times it's done each page and how many hits the site receives.

Thanks for posting!

Tom