Forum Moderators: coopster

Message Too Old, No Replies

Translating a PHP app - best method?

         

httpwebwitch

3:21 am on May 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've suddenly been charged with the task of architecting I18N for a large web app. The site exists in English, and must be localized into Spanish, French, German, and Russian. And that's just the start... next year they want to add Dutch, Hindi, and Portuguese.
Thankfully they've left Chinese, Japanese, Arabic and Hebrew for last.

I've been involved with many bilingual web projects, and there are so many ways to do it...

Over in .NET world there's a more or less official according-to-hoyle I18N method they use. Not that everyone uses it mind you. Is there a similar "best method" for a site built in PHP?

Bear in mind that this app uses many, many thousands of phrases. Just the translation work will take many weeks, but that's being done by an agency. I (with helpers) have the enviable task of compiling the phrase list for them.

some requirements and preferences:
- every phrase must be translatable
- some kind of syntax for pluralizing and variants, eg. "your cart contains {%d} {plural:[item|items]}" or "{gender:[He|She]} has not logged in".
- very fast phrase retrieval and page rendering
- responsible memory usage
- phrases stored primarily in a MySQL database, but with appropriate caching or whatever

I'm *just* getting started on this. It was assigned just hours ago.

Who's got advice for me?

httpwebwitch

3:32 am on May 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



one blog I read claims that SQLlite is the fastest way to retrieve translated strings, better than using ini, xml, array, txt, mysql, etc.

eelixduppy

1:26 pm on May 4, 2010 (gmt 0)



A project I worked on last summer utilized Smarty to prepare for text translations; it is what the client asked for. As far as programming was concerned it made it really simple. I just surrounded all text in the template with {t} tags and wrote a custom function that would lookup the translation.

If you take a look on the web for Smarty translation techniques, there are some more complex ones that I haven't really looked into.

Any case, it's an option if you can afford the extra overhead that Smarty gives you.

httpwebwitch

3:41 pm on May 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Smarty is used for a few small things peripheral to the site, but primarily it's marked up with plain old HTML in *.php files.

eelixduppy

7:52 pm on May 4, 2010 (gmt 0)



>> phrases stored primarily in a MySQL database

And those that aren't? How do you plan on handling translation? IMO it has to be one way or the other, not both. Either you create some function in PHP that grabs the translation from the database, or you serve up a different page of "static" content for each language. Depending on what is being translated and how much, the latter option has obvious performance advantages.

jatar_k

1:18 pm on May 5, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I guess you could store and manage it all in a db and then gen files with language arrays from the db

oh no, not thousands of phrases, hehe, sorry not helpful

file with an associative array of phrases is pretty common, could be preloaded into mem for speed, one file per lang, though reloading after you change it and generate new files would then need a restart.

Files can be broken down per section too so that the arrays aren't as huge.

httpwebwitch

1:40 pm on May 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've tried out a couple of prefab packages now. Most are utter crap, one of them is not bad, but has some annoying deficiencies... I think (as usual) if I want this done elegantly to my taste, I'll have to build something from scratch.

Here's an SEO question:
Should I force languages to be on their own URL? I'm thinking of using subdomains, ala "example.com" (english), "es.example.com" (spanish), "fr.example.com" (french). That seems like it would be easier than registering country tld's like "example.fr". The solution I'm testing right now merely stores the language in a session.

I don't have the translations yet. So while testing I'm translating the site into "pirate". Arrr.

CyBerAliEn

4:15 am on May 6, 2010 (gmt 0)

10+ Year Member



Should I force languages to be on their own URL? I'm thinking of using subdomains, ala "example.com" (english), "es.example.com" (spanish), "fr.example.com" (french). That seems like it would be easier than registering country tld's like "example.fr". The solution I'm testing right now merely stores the language in a session.


Though making a site "example.fr" will make it obvious it is French (to computers or humans), you really shouldn't experience any huge issues using "fr.example.com" (for instance) as far as I know or can recall. The most important part would be that (a) your content be in French, and (b) very helpful: your URLs be in French.

TheMadScientist

8:07 am on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The most important part would be that (a) your content be in French, and ...

(b) Google Changes Their Advice:
Use top-level domains: To help us serve the most appropriate version of a document, use top-level domains whenever possible to handle country-specific content. We're more likely to know that http://www.example.de contains Germany-focused content, for instance, than http://www.example.com/de or http://de.example.com.

[google.com...]

Personally, I think enough people use sub-domains and since you can target in WMT if you set them up as a separate site, you should probably be okay, but it was such a great lead-in I couldn't resist... :)

httpwebwitch

2:09 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't want to sound pedantic, but my content isn't being localized for geography, it's being translated for linguistics only. So, es.example.com is for Spanish-speaking people worldwide whether they're in Barcelona, Miami, Lima or Tokyo. To me, it seems like the TLD's are best suited to geographic localism, where proximity matters and language is a side-effect of location.

For instance if I were to inhabit example.ca, it may still be bilingual in English and French. But my English content is just as relevant whether it's a com or a ca or a co.uk

So, I agree with Google's advice, but it doesn't apply in my situation, because I'm not offering country-specific content; nothing I have to offer is more relevant in one country or another. I'm just doing this for people who don't read&write in English.

Now, I'm at a point where I have the subdomains set up, and I'm writing a class that will use sessions, cookies, and headers to figure out if a first-time or repeat visitor should be sent to another language. I haven't figured out all the logic yet.

All the subdomains (es., fr., de., etc) are pointed to the same public_html folder. This - I know - could be a duplicate content disaster if not executed properly...

I'm not under any illusions that this is going to be easy, or quick. I18N of a big web site is a huge, scary task. I haven't even begun investigating currency, date and number formatting preferences.

TheMadScientist

3:37 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't want to sound pedantic

Then the first sentence would probably have been sufficient.

it was such a great lead-in I couldn't resist

I'll try to refrain from posting what I think is humorous. You should be doubly fine if you're not using them for geo targeting. I did really read through the thread that much. I just thought it was a funny different (b).

jatar_k

12:52 pm on May 7, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



<modnote>let's keep jibes to a minimum and concern ourselves with the topic at hand</modnote>

if you are specifically targeting languages, not geography, which version of french/english are you going with? I assume canadian for both but who knows, maybe that's not your market.

which pushes the though if en.example.com is canadian english then www.example.co.uk would be uk english for french .fr would be "francais de la France" and fr. would be french canadian?

a hybrid approach can be useful depending on exactly how many things you need to target.

>> I've tried out a couple of prefab packages now. Most are utter crap,

absolutely, they program everything for everyone and usually end up doing everything poorly instead of one thing well

timster

1:38 pm on May 7, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've tried out a couple of prefab packages now. Most are utter crap,


(I'm a big fan of PHP but...)

If the language you're using don't have the tools you need, maybe it's time to port the app to one that does. A big i18n project is an opportunity for that. You seem to like the Dot-Net i18n stuff. That's not the only choice.

httpwebwitch

5:38 pm on May 7, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@jatar_k, the existing copy uses US english spellings. But since I'm setting up the framework for I18N, offering UK/Canadian spelling is feasible. But... not high priority :)

I'm using the standard ISO 369-1 list of language codes. I think the rest of the web does the same... here's the resource: [loc.gov...]

httpwebwitch

5:13 am on May 9, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim Plush, software architect, says this in his refreshingly unstyled blog:

So what is the best way to externalize your strings? I did some initial performance testing and this is what I've come up with so far:

1. INI FILE - store all your strings in a .ini file such as welcomeMsg = "Hi There" which can be parsed with parse_ini_file()

2. Native PHP Arrays - store your strings in a .php file using $msg['welcomeMsg'] = "Hi There"; which can be just included as a native php array

3. XML Format - store your strings as <string code="welcomeMsg">Hi There</string> and parsed using php5's simpleXML

4. SQL LITE - storing your strings in a SQL lite DB per language and having a function wrapper for queries

the winner?
well in a test of 50,000, 10,000 and 1,000 string files #4 SQL LITE was FIVE times as fast as SimpleXML (2nd place) and something like 20 Times faster than parsing INI files.
source [litfuel.net]

He provides test cases, and actual benchmark measurements.

So based on his advice, I'm building this translation using Sqlite, with one table per language. The tables are named "text_en", "text_es", "text_fr", etc.

Working with Sqlite isn't fun. A big part of today was spent trying to find a simple way to import/export data from a Sqlite db, and so far I have not found one. I fear having to cut and paste phrases into a <textarea> by hand just so I can INSERT them into the db :(

The keys are varchar phrases, usually abbreviated versions of the english phrase, like "welcome_to_home_page". PRIMARY KEY index on the key, NOT NULL.

Next task is to create an optimized retrieval function that gets the translated text out of SQlite, as efficiently as possible.

usage example:
<?php print(translate("welcome_to_home_page")); ?>

jatar_k

1:01 pm on May 11, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



just want to add that everyone needs to evaluate their particular environment, in my case I need to move the translation out of the db to file as the db is too slow.

I am a little surprised at the differences he found between xml and file, my own tests oin the past, again on particular environments I was in, showed xml being slow.

good link, thanks

httpwebwitch

1:13 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A recent I18N project I was involved with (in .NET) used a *massive* XML file as permanent storage for the translation keys. However it was parsed into an in-memory "dictionary" by the DLL, so its retrieval was incredibly fast since there was no disk i/o involved once it was all up & running.

Opening a connection to a regular INNOdb SQL database to retrieve phrases, even with a primary key index, would be too slow for a high-load environment or very large lexicon. Perhaps imperceptible under light loads or with a small index, but I can imagine it would get clumsy as it grows.

My tests with Sqlite have been OK so far, but I haven't got 10,000 items in there yet so we'll see. I'm concerned that this Sqlite db I'm building is being loaded with each page hit. Plush's tests showed that it's the fastest for a single page view, but is it ideal for a heavy traffic site?

A liberal application of memcached would alleviate a lot of those concerns... that's another ingredient that makes this project more complex than your typical PHP scripting job.

Matthew1980

1:39 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi there httpwebwitch,

I have just read the link as you posted from Jim Plush, and found it really informative, I have never really looked at data retrieval times as I didn't think as a few uS here and there would make much difference.

But since reading this guy's research you can see the difference of speeds using different methods. I always thought as I was safe to use array's for string storage, or even a file full of define('Welcome_text', "Welcome to my site"); type declarations. But you can see the difference in speeds, and when you then take into account the potential load of page views per user, this has quite an impact on how quickly a site will load - I mean how often have we all sat there waiting for a page to load complaining, when all it could be down to is just the way in which the websites text is called and method dependant.

Yes in the past I have ignored this advise about trying to speed up the delivery of a script's contents, but after reading this, it will now be a thought for me to consider!

So thank's for posting a really informative link.

Note on DB's though, I have not used INNODB, and SqlLite, I just stick with MySql. And I haven't tried using XML yet as I don't really know the intricacies of usage, one for the to-do list methinks.

Cheers,
MRb

coopster

1:57 pm on May 11, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



just want to add that everyone needs to evaluate their particular environment


I am a little surprised at the differences he found between xml and file


Exactly and exactly. The read is 5 years old now. Nothing against the research or the author here, mind you, but much has changed.

Opening a connection to a regular INNOdb SQL database to retrieve phrases, even with a primary key index, would be too slow for a high-load environment or very large lexicon. Perhaps imperceptible under light loads or with a small index, but I can imagine it would get clumsy as it grows.


I know a few folks that would certainly disagree. I believe the reason the author used sqlite is the nature of the application. sqlite ITSELF is the RDBMS, not an API to a separate process. But this is why I agree with jatar_k ... evaluate your own environment.

jatar_k

2:59 pm on May 11, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



another note :)

>> But you can see the difference in speeds, and when you then take into account the potential load of page views per user, this has quite an impact on how quickly a site will load

but those speeds don't scale as is, there are too many other factors involved

when the db times start scaling upward, it is possible to have the file access times be isolated from that increase.

memcache, which is a very common implementation for lang files, also makes the tests on the other site irrelevant. Then testing memcached xml against arrays/ini style would be interesting. Memory footprint of the 2 formats is also something worth considering.

as coop mentioned your sqlite will probably remain fairly fast due to it's nature.

coopster

3:23 pm on May 11, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I use MySQL -- utf8 and language identifiers in the table structures. Blazing fast. If you want faster or less DB intensive you can cache the pages, as mentioned earlier. I am not a fan of Smarty, no offense to the developers or anybody here that uses Smarty. It's just that it is what I call a "layered language". Smarty uses PHP to make pages ... in PHP. Icky. My history with development software to develop software within the core software goes way back and it has always left a bad taste. For caching, check out XCache [xcache.lighttpd.net]

httpwebwitch

4:56 pm on Jun 7, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



update on this:

I've implemented the i18n translation using a SQLite database. It seems pretty fast to me. After trying several open-source i18n classes and solutions, they all had various deficiencies... so (sigh) I built my own. I'm getting started putting Spanish and French phrases in the db, and tucking placeholders into the markup.

Unlike Jim Plush's example above, I've got all the phrases in one database, with a table per language. That was a sloppy move... instead I should have created one database per language, with a single table in each. I will switch that over before the project is launched.

Some details that go beyond simple PHP string lookups:

- I have English phrases hard-coded in JS scripts. I know two strategies to deal with this: one is to create a JS script that defines a global-scoped array with phrases in it, that is loaded on the page and accessed from the JS asset. The other strategy is to parse the JS files with PHP. I've chosen to do the latter. I'll be sure to employ a little URL rewriting to make sure that the Spanish version of a JS file is cached properly as scriptname-es.js, so people who switch languages aren't presented with cached assets containing foreign strings.

- oh, all those images with text in them. There aren't that many, but I do have a pile of graphical buttons. Each of these (e.g. "button-name.png") are being re-rendered as "button-name-en.png", "button-name-es.png", "button-name-fr.png", and the page script chooses the right one based on the current language.

- knowing phrase context in the key name. There are times when you need to know the context of a phrase before you can apply a translation. For instance, the link that says "home" goes to the home page, but the category called "home" that contains household products... they have different words in other languages. So it's insufficient to have a key in the database named "home" and use the same translated string in both situations. Similarly, a category named ">programming>javascript>libraries" is not the same as "services>municipal>libraries". You can't use the same word for both meanings in French or Spanish.

- I have not dealt with the Smarty templates yet. I'm using plain vanilla PHP to render pages, but Smarty is used for all the outgoing email notifications. I don't think that will be difficult; I'm just going to make multiple templates with language-suffixed names, and apply the appropriate one. Sort of like I'm already doing with the images.

- dates, currency, etc. I haven't done any of that localization stuff yet.

- the "sprintf" kinds of phrases, like "you have %d new messages". I haven't dealt with those yet, and there are many of them yet to be translated.

- not doing any cacheing yet. I think I'd like to build all this with no cacheing to see how clunky it gets, so I can marvel at the speed increase when I implement caching. Like saving dessert for the end of a meal.

- URLs are NOT going to be translated. Sorry. that is just too much to ask at this juncture

This is a huge job. Gargantuan. I knew it would be. But I'm happy with the progress so far.

httpwebwitch

3:15 am on Jun 9, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I just switched the SQLite schema; instead of one large database with all the languages in it, I have several smaller ones with one language in each. So, as a page is being rendered, there's only one of those floating in memory (because a page is only in one language at a time).

Any speed difference, if there is one, is imperceptible. I didn't bother doing any benchmarking, so I can't tell you if it saved a few milliseconds.