Localisation: How we guess the target translation language in Virtaal

In Virtaal, our desktop Computer Aided Translation (CAT) tool, we've have a number of usability goals. One of those is trying to limit the configuration required to use the tool. Most of us think nothing about setting the target translation language in our CAT tool when requested. But we've always asked the question, can't the CAT tool work this out itself?

In this post I'll talk about how we've been able to correctly determine the target language for about 87% of the localisation files on a typical Linux system.

I'm a translator, how does this help me?

Most translator, who work in one language and one direction, are probably wondering why this is an issue. For anyone who translates in both directions, translates a number of language or who manages a number of translation teams will understand just how important this feature is. When they open the files their language settings will be changed and should be correct.

The feature allows the CAT tool to configure itself without any intervention from the translator, apart from the simple act of opening a file for translation. But even a single language translator will benefit from this feature as a translator will examine other translations to see how someone translated the source text. In this case Virtaal's settings will change for this quick lookup and will change back when the real translation begins, all without the translator doing anything.

I personally review a number of translated languages. I like using Virtaal as it simply reconfigures itself to the target language when I open a file. Mostly I don't need to even check that the selected target language is correct. My Machine Translation, Translation Memory, terminology and spell checking are automatically enabled for the correct target language.

A little history and some background information

We've been building this language guessing for Virtaal for some time now, our aim is to do the right thing with minimal user input. When first run Virtaal's approach is to first try to determine the target language by examining the environment. This mostly involves looking at your locale. This was our first effort to get the language right.

The Translate Toolkit, on which Virtaal is built, allows us to determine the source and target languages of a number of file formats (TMX, XLIFF, Qt). Thus once we load a file we're able to look at the file metadata to determine the language pair. But this doesn't work on PO files since there isn't any target language information in the header.

The missing target language information in Gettext was why I proposed that we add a language header to Gettext PO files. Fortunately this idea was accepted upstream and it has been implemented in Gettext. However, we're still waiting for this new version of Gettext to be released and once released we'll still need to wait quite some time for it to gain wide adoption.

So while we waited for Gettext 0.18 to be released we implemented ngram matching techniques as another approach to guess the target language. This works quite well but we need language models for each language that we need to guess. Ngrams are still useful for us in Virtaal as we add the ngram guessed language to our language pair chooser, thus if the target language is incorrectly indicated we'll still include the ngram guessed pair the language chooser list.

Virtaal language chooser popup

Realising that we can't wait around for Gettext 0.18 to be released and for it to filter down into distribution over 1-2 years we decided to look at other ways in which we could more reliably determine the target language based on information in the file header.

Language-Team header analysis

We've looked at analysing the Gettext 'Language-Team' header entry to help determine the target language. To do this analysis our script msgunfmt'ed the 15,000+ MO files I have on my Fedora 12 installation. This created long lists of the potential Language-Team headers that we then ran through our guesser. We added information and improved the guesser as we identified patterns in the extracted headers.

In the analysis we found the following:

  • A Language-Team of English is almost always a false positive. E.g "Kannada <en@li.org>", an English email address for the Kannada team, unlikely.
  • Small languages almost always get this header wrong. E.g an Hawaiian translation has this header "English <en@translate.freefriends.org>".
  • Some meta language translation projects don't distinguish between the languages that they are translating. This mostly affects Indic languages e.g. "<info.gist@cdac.in>" is used for a number of Indic translations.
  • Some projects use generic contact information. Examples include: wxWidget, Novell, Compiz and OpenSUSE. Technically there is nothing wrong with this and we can work around it if the actual target language is mentioned, but often the target language isn't mentioned.
  • Even with these issues we can safely guess 87% of the target languages from the headers with minimal false positives.

In the cases where we can't guess the language we're almost always dealing with: missing or default header information, English headers, or personal email addresses that we've excluded.

Here are some of the details of our analysis:

  • Analysed 15244 MO files.
  • Could not classify 7,5% (1133).
  • Incorrect language classification for 5,5% (848) of the files. Many of these cover issues were translators have indicated regional variants, e.g. de vs de_DE, af vs af_ZA, bn vs bn_IN or different encodings e.g. sr vs sr@latin.
  • Only 1,8% (287) are true misclassifications. Most of these are due to incorrect language information in the headers. This probably indicates that the data is quite reliable more then it highlights any issue

So combining this data we can safely and correctly guess 87% of the language teams based simply on the team header. We expect 5,5% to be incorrect or to not capture the regional and encoding information. We can't guess 7,5% of the headers.

Even though we'll miss guess some target languages, the translator will still be able to set their target language within Virtaal. This will allow them to correct any bad classifications and also ensure that when saved the file will use the correct Gettext 'Language' header. We won't need to guess the language again.

How does our guesser use the Language-Team header to guess the target language?

Our analysis of existing headers was to help build our actual Language-Team guesser. We guess the target language as follows:

  1. Firstly before we even try analyse Language-Team we first look for the Language header, then headers used by Poedit. These headers are likely to be correct and are set by the users to actually indicate their target language. If we don't find those header then we move onto the Language-Team analysis.
  2. Our first step in the Language-Team header is to check with a number of regular expression for common language team email addresses. Thus "<fr@li.org>" is easily identified as French. By using a regex we also future proof the guesses and can detect teams that emerge later.
  3. Then we use snippets of contact information which are almost always email addresses and sometimes URLs. These are essentially team contacts that can't be detected with our regular expressions.
  4. Lastly, we use snippets of language names both in English and the target language, e.g. Dutch and Nederlands.
  5. If all that fails we give up guessing.

Can I see this in action?

You can see this in action if you run a recent version of Virtaal with Translate Toolkit 1.7.0 (which was released on 2010-05-12). Windows users will need to wait for a new release of Virtaal (>v0.6.0).

How can you help?

We think we've got most of the data sorted out, if you can help us reduce the 5,5% misclassification and 7,5% unclassifiable entries then that would be great.

If you are a translator then please have a look at our team.py file and check that your team's email address (see LANG_TEAM_CONTACT_SNIPPETS) and that your language name, variants or other defining information (see LANG_TEAM_LANGUAGE_SNIPPETS) are listed.

But probably the easiest and best way that you can help is to use a good localisation tool, such as Virtaal, Pootle or Poedit, that captures the target language information in the header. The next best thing is to make sure that you make use of very standard contact information for your team so that its easy to guess your language.

AttachmentSize
language-chooser.png29.28 KB