Smarter AI to Identify Languages: When Scripts Are Not Enough

Did you hear that many people were banned at Twitter just by typing in Cyrillic? The reason was that thousands of Russian bots sent plenty of tweets in the two days preceding the EU referendum. It’s true that Russian language uses letters from the Cyrillic script, but the same is true for more than 20 languages around the world!

Some months ago, Twitter suspended many accounts and hid tweets in Bulgarian for the simple reason of using the Cyrillic alphabet. It seems like Twitter tuned its algorithm to wipe out Russian bots and trolls, and since Russian language uses Cyrillic, the very use of this alphabet was completely banned.

Among other European and non-European languages, Cyrillic is the standard script for Slavic languages such as Macedonian, Bulgarian or Russian. While it is true that Russia accounts for about half of the people in Eurasia using Cyrillic as the official alphabet for their national language, this is not a sufficient reason to punish the whole group for a member’s mistake.

As is so often the case, one size does not fit all. When we glance at a text and see an alphabet different from Latin, we tend to associate it to a majority language. Nevertheless, that alphabet, as in the case of Cyrillic, will be surely used as the writing system for many other less-known languages. 


© World Standard


A poor language identification system would analyze a text by merely paying attention to its alphabet. If the writing system is different from Latin, it will probably assign it to the language most used for it. The most challenging cases here are for those languages which are very similar and even share writing characters. This can turn out to be a dangerous practice as it has been proved before – especially nowadays where everything may go viral in a matter of seconds.

An example of a challenging language to detect could be Standard Serbian which uses both Cyrillic and Latin scripts. When written in Latin, Serbian language is also quite similar to Bosnian and Croatian with minute differences. These two challenges can be overcome through the Bitext language identification tool which can identify more than 77 languages and variants. This useful tool takes into account linguistic and morphological information when detecting words from every single language. To do that, linguistic resources such as lexical or data dictionaries are needed.

In Twitter, there are people from all over the world writing in any language. With a suitable language identification tool, Bulgarian and Russian could have been easily identified avoiding the awkward situation of banning every Cyrillic writer. 



Subscribe Here!