Localisation Information Language - preventing mistakes and increasing the richness of localisation
You might want to call this an announcement of sorts. This is something that has been brewing in my mind for many years now, its just that I got irritated, found the time, sat down and began writing and planning.
I'm trying to design an language or system that allows localisation information to be conveyed to a translator. A language that is simple enough for a human to understand and act upon such as "dont translate word 'IMAP'". Yet regular and well formed so that it is machine readable so that a localisation tool can understand the commands and take action on them. In this case it can test that the acronym IMAP has indeed not been translated.
The initial thoughts for this where inspired by the LOCALIZATION NOTES in Mozilla source code. In the Translate Toolkit we use the DONT_TRANSLATE command to drop messages that should not be translated. So we reduce the burden for a translator in that they never see those messages, ever. Saving them time. But in my mind these notes have never reached their full potential and this project is an attempt to push localisation notes further. This kind of inteligence forms part of our funded work for Mozilla and I'm keen to see if I can't use this towards my studies.
I've spent quite a bit of time fidling with grammar and layout to get something I feel is usable. More about that in later posts.
Sorry that I'm a bit sketchy on the commands and grammar, they are still in a state where I move large chunks into common commands as I discover commonalities. For now let me describe in text mostly of what I've got covered:
- Whether you should translate something or not
- XML localisation: which parts can you change
- Character set limitations: what are the characters that you may use
- Lengths of strings: how long or short a string may be
- Localised items: numbers and other items that should be localised but you wish someone had told you
- Variables: a system to allow examples and descriptions of variables and to identify non obvious variables
- Configuration options: sizes, valid options, etc
That's it for now. I'm still investigating things like: credit entries, language configuration entries and others.
I have found this quite a fun exercise. I'm convinced that a language like this is more powerful in the hands of translators then programmers. The simple reason being that programmers simply don't understand all the issues localisers face. While translators would love a way to leave instructions for those coming after them, these could be people on their own team or in another team, that prevent them making the same mistakes.
A simple example, of something I found in OpenOffice.org. OpenOffice.org has 11 styles of variables, pofilter can test for all of them. Yet in one message the word DOMAIN apeared in capitals. It wasn't an OpenOffice.org variable yet it just looked like a variable but you could not be sure. Checking in other translations it was a 50/50 split as to who had translated it and who hadn't. Finally after getting some feedback from the developers we learnt that yes it is a variable and thus should not be translated. The question is how many people have localised it? How many had to invetigate the issue to discover over and over again that they should not translate the word DOMAIN? And how many new localiser will continue to translate the word because there is no comment to remind them of the finding. This localisation language would allow the new localiser to not make that mistake and would allow tools to check that no further mistakes are made.
Fortunately with FOSS a translators can get such a comment included in the source text and in such a way communicate valuable information to the 50 other language teams that follow them. Thus since translators are the people who care, and FOSS allows them to care I'm convinced that localisers will be the best implementors of such a system.
I've worked hard to think about how we ensure that the comments can be placed at the root. Thus they must appear in the DTD files, in the C code and they end up in PO and other intermediate formats after their extraction. Thus they have some permanance in that they ride with the code.
With this language we should be able to do a few things that have not been possible until now:
- Eliminate all errors in configuration settings: Sounds impossible but by preventing errors as they're made and easily and quickly checking for them I'm convince we can eliminate all of them. If some remain we need to extend our localisation language.
- Produce more complete localisations: as we can communicate to translators that they may and should localise things that in the past they have been too scared to do such as a date or a number we end up with localisations that are more closely aligned with the cultural conventions of the language.
- Produce higher quility translation involving variable data: by communicating more clearly which parts of a message are variables, what they do and examples of them in use we can allow a translator to fully grasp what they will be producing.
- Reduce or eliminate the complexity and limitations of some formats: work around localisation issues in certain source formats such as accesskey in Mozilla and sentences broken across multiple strings.
The language tries best to be backward compatibile, thus a DONT_TRANSLATE note from Mozilla source although deprecated in the new system will still work.
Large amounts of time were spent reading the XLIFF spec. To be honest getting a bit depressed about it at the same time. But it was a good exercise to get an understanding of the types of localisation information XLIFF is able to store and convey.
My general plan moving forward is to do the following:
- Continue to flesh out the syntax of the system, eliminate duplication, increase human readability and consistency
- Review all Mozilla files, OpenOffice.org and others to find common issues that are not covered and that can be handled by a machine
- Build a parser in Python
- Add some tests to pofilter to see if my concept works as expected
- Longer term: I'd like to take these findings to XLIFF and improve the QA component of XLIFF. With the vision for XLIFF 2.0 being to create a skeleton format on which to hang various extension I would see this language and QA in general as an ideal extension for XLIFF.
I hope that in the coming few weeks I can get a draft out for other people's input. For now please leave your comments, I'd especially like to hear about any problems that you encounter that would be worth trapping. Just a heads up, no this is not an l20n like project, I'm not trying to solve the issues of language, declention, gender, etc. Others who feel the pain of those issues are working on them and the two are certainly not mutually exclusive. For me localisation information interchange has some really low hanging fruit that all can benefit from now and we have the tools that can implement the checks that are needed so I'm pushing ahead in that direction.
So forward to healthier localisations environment with - Localisation Information Interchange - shall we call it LILy?
Don't forget to leave comments about your current pains and lets see if we can get them integrated into LILy.
- dwayne's blog
- Login or register to post comments

Re Localisation Information Language
Hi Would this differ from ITS?
Andrew C.
Re: Localisation Information Language
ITS in its current form is about the extraction or identification of localisable content in XML. There are some areas of discussion in draft in their wiki that deals with content transformation or content explanation (I'll think of better words once I've posted this reply I'm sure), such as this is a date. But currently none of that is in the spec and the spec focuses on what should or should not be translated.
Where I see this as different is that it is not aimed at XML, yet. It is aimed at being a common language that can be embedded in any source format. Either it is used directly there e.g. a tool that edits monolingual DTD files, or it is carried across to the target format such as PO or XLIFF. I'm not interested yet in codifying that XML representation and still need to do some more research on how/if to align this with what is in XLIFF.
Should ITS cover this? Maybe, and maybe ITS is the right home to codify the XML representation of the language that I'm building here.