The explosion of Artificial Intelligence over the last year has generated an increasing interest in Natural Language Understanding (NLU) technologies to build systems capable of interacting with the end customers in their own language in a really “natural” way. In the last months, we have seen many user interfaces turn into conversational chatbots that can be controlled or interacted using natural language.
However, these interfaces have been developed successfully for a small set of languages, mainly English. Even if getting these devices to detect languages other than English has proven to be a challenging task by itself, the next frontier will be to have interfaces capable of understanding and managing multiple languages or even capable of dealing with a real mixture of different languages at a time. This is where language identification comes to the front line.
Language identification techniques commonly assume that every document is written in one of a closed set of known languages for which there is training data and is thus formulated as the task of selecting the most likely language from that set.
At Bitext, we want to remove this monolingual assumption, and address the problem of language identification in documents that may contain text from more than one language from the candidate set. In fact, at this very moment we are devising a method that concurrently detects whether a document is multilingual and estimates the proportion of the document that is written in each language for a set of more than 50 languages and language variants.
- Script identification
Of course, first clue in language identification is the script. The script is the set of characters that is used to write in a given language, and depends on the writing system that language is written in.
Source: The Cultural Spotter
Not a translation: this sign shows the city of Belgrade's Serbian name in both Cyrillic and Latin
A non-negligible number of languages have its own script that identifies them univocally (Armenian, Georgian, Gujarati, Telugu…) while some other scripts are used by many different languages. Latin script is used for many European languages but several Asian languages use it as well, like Indonesian, Malaysian, Vietnamese, Tagalog… Similarly, Cyrillic script is used by some Eastern European languages but also in Asian languages like Kazakh or Kyrgyz.
Then again, there are languages that can be written with different scripts. This is the case of Serbian, since its official script is Cyrillic, yet a great amount of speakers use Latin alphabet, or Malaysian, which is most widely written in Latin, but a derivation of the Arabic script can also be used.
- Language identification without training
Traditional approaches for language detection involve statistical techniques based on the availability of a training set that covers the set of languages that need to be identified. It is relatively easy to find and produce monolingual training sets for language identification but the task to produce the same training sets for multilingual content has proven to be not that easy.
In this case, Bitext has one of the widest lexical and morphological databases covering 50+ languages (and increasing every day). With this database as starting point, we have designed an extremely efficient way to access this full database that allows to produce excellent results, both in quality and performance, in language identification in multilingual environments without the need for any training set specifically built to the task.
Only thus any of the 50+ languages can be identified at the same time, allowing multilingual texts to be analyzed. Furthermore, each part written in any of the different languages can be isolated.
Of course, the system works perfectly for large documents but we are also obtaining excellent results on language identification for shorter texts like tweets or reviews.
- Language variants identification
The ultimate challenge in language identification is to differentiate among language variants of the same base language (for instance, the Spanish variant spoken in Spain and the ones spoken in Argentina, Mexico or Costa Rica), which has also been achieved with Bitext technology.